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ABSTRACT 


'a 

A  model  is  proposed  for  binary  time  series  with  marginal 
probabilities  given  by  logistic  regression  on  explanatory  variables,  by  analogy  with 
the  first  order  autoregressive  error  model  for  least  squares  regression. 
Measurements  at  adjacent  time  points  are  assumed  to  have  an  odds  ratio  that  is  not 
equal  to  one  and  that  is  constant  as  a  function  of  time.  Measurements  separated  in 
time  are  assumed  to  be  conditionally  independent  given  an  intervening 
observation. 


Consequences  of  using  an  ordinary  logistic  model  in  the  presence  of 
serial  dependence  are  explored.  The  closest  logistic  model,  defined  as  the  one  with 
the  minimum  Kullback-Leibler  distance,  is  shown  to  be  the  one  with  the  same 
marginal  probabilities.  Consistency  of  the  maximum  likelihood  estimator  of  the 
serial  dependence  model  is  proved  under  certain  conditions,  and  a  procedure  for 
finding  these  estimates  is  given. 

Properties  of  the  model  are  found,  including  expressions  for  the 
joint  probabilities  and  the  odds  ratio  between  observations  separated  in  time.  The 
model  is  shown  to  generate  *-mixing  processes. 

A  score  test  is  derived  in  order  to  test  for  independence  after 
performing  an  ordinary  logistic  regression,  and  properties  of  this  test  are  explored. 
The  effects  of  missing  data  on  the  score  test  and  on  estimation  of  the  odds  ratio 
(with  known  coefficients)  are  presented. 

The  model  is  applied  to  the  problem  of  automatic  classification  of 
EKG  data  based  on  feature  extraction.  A  positive  serial  dependence  is  found  in  the 
examples  presented. 
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Chapter  1 
Introduction 


1.1  The  aerial  dependence  model 

Logistic  regression  is  a  comon  procedure  for  modeling  a  binary  vector  Y 
when  there  are  explanatory  variables.  Under  this  model 


P[Y  =1] 

l0g  P[Yt=0“]  =  XtP 


where  X  is  the  vector  of  explanatory  variables.  This  model  assumes  the 
observations  on  Y.  given  X,  are  independent.  If  Y  is  a  time  series,  how¬ 
ever,  the  independence  assumption  may  not  be  realistic. 


A  similar  problem  can  occur  in  ordinary  least  squares.  If 

Tt  -  x;p  +  s  • 

it  may  be  reasonable  to  assume  the  sequence  (et)  is  serially  correlated. 
The  simplest  model  for  a  serially  correlated  series  {etJ  is  the  first  or¬ 
der  autoregressive  model,  with 

«t  =  Pet-1  +  °t 

for  some  p  less  than  1  in  absolute  value  and  for  a  sequence  of  independent 
normally  distributed  (ut)  with  common  mean  0  and  unknown  variance  a*. 


The  parameter  p  measures  the  correlation  between  successive  values  of  sti 
this  is  a  natural  parameter  to  measure  dependence  between  normal  random 
variables.  For  binary  variables,  however,  the  odds  ratio  is  in  many  re- 
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spects  the  natural  parameter  for  measuring  dependence. 


This  reasoning  leads  to  the  following  model  for  serial  dependence  in 
logistic  regression: 

P[Yt=l] 

log  P[Yt=0l  =  XtP  ’ 


[1.1] 


P[Yt=Yt_1=l]  P[Yt=Ytl=0] 
P[Yt=l , Yt_1=0]  P[Yt-0,Yt_1-l] 


for  all  t. 


[1.2] 


and  Yr  and  Yfc  are  conditionally  independent  given  Y#  if  r<s<t.  (All 
probabilities  are  conditional  on  the  (Xt)  sequence.)  In  the  remainder  of 
this  paper  I  will  refer  to  this  model  as  the  "serial  dependence  model." 

1.2  Related  models 

Other  models  for  binary  time  series  have  been  proposed.  For  series  with¬ 
out  covariates.  Billingsley  (1961)  considers  stationary  Markov  processes. 
Keenan  (1982)  considers  processes  whose  marginal  probabilities  are  func¬ 
tions  of  an  underlying  stationary  process.  Kedem  (1980)  also  considers 
stationary  binary  time  series. 


Logistic  regression  models  for  variables  measured  over  time  have  been  used 
in  studies  of  "panel”  or  "longitudinal"  data,  in  which  repeated  binary 
measurements  are  taken  on  a  large  number  of  subjects.  Typically  each  in¬ 
dividual  time  series  is  short,  and  any  asymptotic  theory  that  is  developed 
holds  as  the  number  of  subjects  approaches  infinity  while  the  length  of 
each  series  remains  finite.  Korn  and  Vhittemore  (1979)  consider  a  model 


,*■ 


in  which  the  conditional  probability  of  {1^=1}  is  given  by  logistic  re¬ 
gression  on  the  covariates  and  on  Yt_^.  (In  models  such  as  these  Yj  must 
be  treated  as  a  special  case,  and  Korn  and  Thittemore  assume  the  existence 
of  Yg=Yfl,  where  n  is  the  length  of  the  series.)  Most  other  models  also 
use  a  logistic  function  for  the  conditional  probabilities. 

Zeger  et  al.  (1985)  also  consider  longitudinal  data,  but  their  model  is 
similar  to  the  serial  dependence  model  in  that  they  model  the  marginal 
probabilities  as  logistic  functions  of  the  covariates.  Apart  from  their 
application  to  longitudinal  data,  their  model  differs  from  the  serial  de¬ 
pendence  model  in  two  respects.  First,  they  use  the  correlation  between 
adjacent  binary  responses  as  their  measure  of  association,  and  they  assume 
it  is  constant.  Second,  their  covariates  are  functions  of  the  subject 
only  and  so  do  not  vary  with  time. 

The  serial  dependence  model  could  be  used  for  panel  data,  and  many  of  the 
other  models  in  the  literature  could  be  applied  to  a  single  long  binary 
time  series  with  time-varying  covariates.  However  my  motivation  in  pro¬ 
posing  this  model  is  the  automatic  EKG  classification  example  in  Chapter 
8,  so  in  this  paper  I  will  consider  only  a  single  long  binary  time  series. 

1.3  The  odds  ratio  as  a  measure  of  association 

Many  measures  of  association  are  possible  for  binary  random  variables,  but 
there  are  some  desirable  properties  possessed  only  by  those  measures  that 
are  functions  of  the  odds  ratio.  In  this  section  I  will  discuss  the  pos¬ 
sibility  that  other  measures  may  be  useful. 


The  analogy  between  the  serial  dependence  model  in  logistic  regression  and 
the  autoregressive  error  model  in  linear  regression  breaks  down  in  the 
case  of  perfect  association.  In  linear  regression,  if  the  correlation 
parameter  is  -1  and  the  coefficients  are  known,  then  knowledge  of  Yt 
provides  perfect  knowledge  of  T#  for  all  s>t,  since  letl  is  then  a 
constant  for  all  t. 

For  binary  variables  an  odds  ratio  of  0  or  infinity  does  not  provide  an 
analogous  property.  Consider  the  following  pair  of  two-by-two  tables  of 
joint  probabilities  of  (A.B)  and  (C.D): 

1  A  0  1  D  0 

R  1  .6  .2  .8  p  1  “Tfi  IF"!  .6 

0  .0  .2  .2  0  .0  .4  .4 

.6  .4  .6  .4 

In  both  tables  the  odds  ratio  is  infinite,  but  only  for  the  pair  (C,D) 

does  knowledge  of  one  member  of  the  pair  provide  perfect  knowledge  of  the 

other  member.  This  happens  only  when  the  two  random  variables  have  the 

same  marginal  probabilities. 

This  feature  of  the  odds  ratio  was  observed  by  Feinberg  (1981).  He  dis¬ 
tinguishes  between  "complete*  association,  as  in  the  (A,B)  pair,  and  "ab¬ 
solute"  association,  as  in  the  (C.D)  pair.  Since  the  serial  dependence 
model  does  not  distinguish  between  the  two,  it  may  not  be  a  useful  model 
in  an  application  where  it  seems  necessary  for  "perfect"  association  to 
imply  "absolute”  association.  If  correlation  were  used  as  the  measure  of 
association,  this  implication  would  hold.  Correlation  was  used  by  Zeger 


et  al.  (1985). 


Let  the  event  be  called  a  'state  change.”  Examining  the  tables 

above  shows  that  in  the  serial  dependence  model  with  an  infinite  odds  ra¬ 
tio,  state  changes  are  possible  if  changing  state  would  avoid  moving  to  an 
event  of  smaller  marginal  probability.  For  example,  let  Yt_^  be  A  and  let 
Yt  be  B  in  the  above  table.  The  transition  from  {A=0}  to  {B=l)  avoids  the 
transition  from  {A=0}  to  {B=0}.  Since  P[A=0J  >  P[B=0] ,  a  state  change  is 
poss  ible . 

Data  from  a  variety  of  sampling  models  can  be  entered  in  a  two-by-two 
table.  For  example,  fixed  numbers  of  patients  might  be  assigned  to  a 
"treatment"  and  a  "control"  group,  and  then  might  be  classified  as  "im¬ 
proved*  or  "not  improved"  at  the  end  of  the  study.  One  advantage  of  the 
odds  ratio  is  that  it  is  invariant  to  row  and  column  multiplications,  so 
in  the  hypothetical  example  it  would  not  depend  on  the  number  of  patients 
assigned  to  each  group. 

The  above  tables  of  marginal  probabilities  suggest  a  sampling  model  in 
which  both  classifications  are  random,  and  the  row  and  column  totals  sum 
to  1.  This  is  the  case  in  the  serial  dependence  model.  In  sampling 
models  where  the  row  and  column  totals  are  not  arbitrary,  the  invariance 
property  of  the  odds  ratio  loses  some  of  its  importance. 

1.4  Notation 

I  will  use  the  following  notation  in  this  paper.  Relationships  between 
some  of  these  quantities  are  shown  in  Figure  1.1.  Note  that  throughout 
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this  paper  the  independent  variables  are  treated  as  given,  not  as  random 
variables. 


i 


Symbol 

Tt 

*t 

Pt 

nt(j) 

“t 

V 

P 

m  * 

¥'  P 


Meaning 

Dependent  variable  at  time  t 
Vector  of  covariates  at  time  t 
Prob[Yt=l]  (marginal  probability) 

Prob[Yt=l lYt_j=j]  (conditional  probability) 
Prob[Yt=Yt_i=l]  (joint  probability) 

Odds  ratio  between  successive  observations 
Vector  of  coefficients 

Maximum  likelihood  estimates  under  the  serial  dependence 
model 

Maximum  likelihood  estimate  of  p  under  the  ordinary 
logistic  model 


Figure  1.1.  Relationship  Between  Parameters 
of  the  Serial  Dependence  Model 


n(1) 


For  any  given  values  of  i|i,  pt,  and  pt-i,  the  corresponding  values  of  tt(  1 )  and 
n(0)  are  those  at  the  intersection  of  the  curve  of  constant  odds  ratio  deter¬ 
mined  by  Hi  and  the  line  determined  by  pt  and  pt- 1  -  The  quantity  pmin  is 
defined  in  Chapter  4. 
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Chapter  2 

Properties  of  a  Process  Generated  by  the  Model 

In  this  chapter  I  will  examine  some  properties  of  a  process  (Yt)  generated 
by  the  serial  dependence  model.  I  will  derive  the  joint  distribution  of 
two  observations  from  the  process  and  express  the  log  linear  representa¬ 
tion  of  this  joint  distribution  in  terms  of  the  quantities  obtained  from 
the  model,  namely  the  odds  ratio  and  the  marginal  probabilities.  I  will 
derive  an  expression  for  the  odds  ratio  between  Yt  and  Yt+n  for  n>l.  I 
will  prove  that  {Y(}  is  a  mixing  process.  Finally  I  will  illustrate  some 
of  these  properties  with  plots  of  the  odds  ratio  between  Yt  and  Yt+n  as  a 
function  of  n  and  f. 

2,1  The  joint  distribution  of  two  observations 

In  this  section  I  will  obtain  the  joint  distribution  of  two  observations 
from  a  process  generated  by  the  serial  dependence  model,  by  relating  the 
parameters  of  a  log  linear  representation  to  the  marginal  probabilities 
and  odds  ratio.  I  will  consider  consecutive  observations,  but  the  same 
results  apply  to  non-consecutive  observations  if  the  odds  ratio  between 
them  is  known.  (A  formula  for  the  odds  ratio  between  non-consecutive 
observations  is  given  in  the  following  section.) 

For  any  given  t  let  pt  =  P[Yt=l]  and  Pt_j  =  P[Yt_1=l],  and  let  f  be  the 
(constant)  odds  ratio  between  Yt  and  Yt_j .  Let  p^j  =  P[Yt_2=i,Yt=j]  be 


decomposed  as 


log  pAj  -  u  +  u1(i)  +  u2(j)  +  ui2( i j ) 
with  the  usual  constraints 

]  *1U)  ’  f  ‘2(J)  m  £  "l2(ij>  *  f  "l2Uj>  * 

First  I  will  express  the  quantities  obtained  from  the  aerial  dependence 
model  as  functions  of  the  parameters  of  the  log  linear  representation. 

The  marginal  probabilities  can  be  written  as 

"  «P<»t»2<i))  ["P<»1(1)«12<11>>  *  •XP<»1(0)+»12(01),) 

-  [•*p<»1(1)+«12(11))  ♦  "p^iur-wai))1 

-  2  cxp<«-H.J(1))  posh  <»1(1)+«12(11)>. 

Similarly 

Pt_i  =  2  exp(u+u1(1))  cosh  (n2(i)+u12(ii) ) . 

The  odds  ratio  satisfies  the  equation 

log  y>  =  log  pn  +  log  pQ0  -  log  p1Q  -  log  pQ1 

(u+ul(l)+u2(l)+n12(ll))  +  (u+ul(0)+u2(0)+n12(00)) 

-  <«*»1(1)*»2(0)-»12(10))  -  ^o+ttl(0)+,l2(l)+tt12(01) ^ 

“  u12 ( 11  )+u12 (00)~°12(10)_o12(01)  "  4  °12(11)  ’ 

The  log  linear  parameters  can  be  obtained  by  solving  three  equations, 
since  given  u2q),  a2(l)'  an<*  u12(ll)'  t*ic  other  parameters  are  determined 
by  the  four  constraints  given  above  and  the  constraint  p^ 


1.  The 


two-factor  term  is,  from  above,  Qi2(H)  =  (log  f)/4j  it  depends  only  on 
the  odds  ratio.  Unfortunately  the  single-factor  terms  depend  on  the  odds 
ratio  and  both  marginal  p.  habilities.  They  solve  the  pair  of  eqnations 
Pt  -2  expCn+Uj^j)  cosh  (ui(u+(l°* 

Pt_j  =  2  exptu+u^jj)  cosh  (n2(i)+(l°*  • 

The  parameter  u  can  be  removed  by  using  these  equations  to  obtain 


pt  11 

log  —  =  2o2(l)  +  lo*  eosh(u1(1)+  j-  log  p)-log  cosMu^jj-  f  log  f) 


and  a  similar  equation  for  log  [pt_k/ (l-pt_j) ] 


2.2  The  joint  distribution  of  three  observations 
For  some  r<s<t  let 


Pijk  =  PlYt=i'  Y8“j*  Vkl 

be  represented  by  a  log  linear  model: 

lo,  ,ljk  -  «  ♦  «1(i)  *  »2(J)  *  »J(l,  *  »12(lj)  +  "23(jk) 

*  u13(ik)  *  °123 ( i  jk)  ' 


with  the  u-terms  satisfying  the  constraints 

f  Ul(i)  =  *  U2( j)  =  l  U3(k)  =  l  °12(ij)  =  J  *12(ij) 

=  |  n23(jk)  "  l  U23(jk)  =  f  °13(ik)  =  l  U13(ik) 

=  f  °123 < ijk)  =  J  U123(ijk)  =  k  °123(ijk)  =  °’ 


[2.1] 


[2.2] 


(These  are  not  the  same  as  the  similarly  named  quantities  in  the  previous 
section.)  The  ujj  and  0^23  terms  are  0  because  Yr  and  Yt  are  independent 
given  Ys  (see  Fienberg,  1981,  page  33). 


Let  b*  the  odds  ratio  between  Tr  and  Y#  and  let  ^3  b®  the  odds  ratio 
between  Yg  and  Yt.  (If  the  Y's  are  consecutive  observations,  then  ~ 
V23  =  •  The  object  of  this  section  is  to  find  fi3'  the  odds  ratio 

between  Yr  and  Yt. 

Let  a  dot  subscript  denote  summation  over  the  corresponding  index,  so  for 
example  p —  =  ^  Pijh*  This  quantity  can  be  written  as 

'ij.  ’  ‘,<,(at"l(i)+”2(j)+“3(l)*°12(ij)*"23(jl)) 

*  «P<»*»,(i)-«2(j)«3<„,«,2(ij)*»23(jO)) 

‘  2  ®*P(n+ui(i)+112( j )+°12(ij ) ^  c°,i<"3(l>+u23(jl>>  • 

Summing  over  the  other  indices  gives 

h.k  *  2  ®xP^u+nj( ^  ®°*‘(«2(l)+*12(il)i«23Uk))  • 

» .jk  “  2  ®xP^tt+u2( j )+n3(k)+°23( jk) ^  00*1,<"l(l)t"l2(lJ)>  • 

Then 

Vl2  =  (Pll.P00.,/(Pl0.P01.)  ’ 

and 

log  Vl2  = 

log[exp(u+u1(1)+u2(1)+u12(11))(exp(u3(1)+u23(11))+exp(u3(0)+u23(10))) 
+log[exp(u+u1(0)+u2(0)+u12(00))(exp(u3(1)+u23(01))+exp(u3(0)+u23(00))) 
-log[exp(u+u1(1)+u2(0)+u12(10))(exp(u3(1)+u23(01))+exp(u3(0)+u23(0()))) 
-log[exp(u+u1(0)+u2(1)+u12(01))(exp(u3(1)+u23(11))+exp(u3(0)+u23(10))) 
This  can  be  simplified  by  using  the  constraintr  [2.2]  to  relate  the 
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u-terms  and  obtain  u2(2)  =  -ul(0)*  etc*  re»°lt  is 


lo«  »12  -  4  "12(11)  •  [2'31 
or  u12(ll)  =  (lo«  Vl2)/4*  Similarly  023(H)  =  (lo«  ?23)/4* 


The  remaining  odds  ratio  is  given  by 
log  fl3  = 

"+"l(l)*a3(l)*1°‘[‘Ip("2(l)*"l2(ll)+o23(U)H,II,<“2(0)+"l2(10)+"23(01)>) 
+o+"l(l)+"3(l)+3o3^exp^"2(l)+"l2(01)+B23(10) ^+e"p^o2(0)+"l2(00)+n23(00) ^ 
'"■"l(l)'“3(l)'1O«l'II,(“2(l)*"l2<ll)*"23(10))-t'Ip("2(0)+"l2(10)+"23<00))1 
■"■"l<l)""3(l)"1OgI'ip<"2(l)4"l2(01)+"23(ll)>*'II,<“2(0)'*"l2(00)*"23(01))1 


‘  10,12  00,h(“2(l)*"l2(ll)+"23(ll))1+l0,[2  co**,^B2(i)-,I12(ll)~"23(ll) 
-log (2  co,H(»2(1)»o12(11)-o23(11)))*log[2  oo.l.<»2(1)-n12(11)*u23(11))) 

cosh[u2(1)+(log  V12+log  |»23)/4]  cosh[u2(1)-(log  f12+log  f23)/41 


coshlu2(1)+(log  r12-lo8  f»23)/41  cosh[n2^2)+(log  f»12"log  *23)/4] 


This  yields  the  relation 

cosh  2o2^j  +  cosh[  (log  ?12+log  f23>/2  ] 

M  *  7  -  -  —  ....  1  1  .  . 


ao2(1)  *12  ,LO*  *23,,X  J 

*13  cosh  2u2(2)  +  cosh[  (log  f12_lo8  f23>/2  ^ 


[2.4] 


For  the  special  case  of  three  consecutive  observations,  y22  =  f23  =  f,  and 


cosh  2&2(i)  +  cosh  log  f 
*13  "  cosh  2Uj^)  +  1 


[2.5] 


These  expressions  can  be  need  to  calculate  the  dependence  between  observa¬ 
tions  separated  in  time.  Suppose  Yi,...,Yn  have  marginal  probabilities 
Pit .../pn  and  common  odds  ratio  f  betwen  Yt  and  Yt+j  for  all  t.  Then 
l>12=p.  and  for  any  t>2,  can  be  found  recursively  by  applying  the  above 
expression  with  marginal  probabilities  (Pi»Pt-l*Pt)  and  odds  ratios 

and  ft-l.t“** 

Use  of  these  formulas  requires  calculation  of  02(2)*  Th®  log  linear  rep¬ 
resentation  provides  the  relation 

u2(l)  "  (lo«  PlU  +  l0g  P010  _  log  p101  ~  log  P000)/4  * 

The  Markov  property  implies  “  P[Yr=i,Yg=j]  P[Yt=k|Yg0j]  * 

P[Yr=i,Y#=j]  P[Ys=j,Yt=k]  /  pg.  The  pairwise  joint  probabilities  can  be 
obtained  from  the  marginal  probabilities  and  o=P[Yr*i,Y#,»j] .  Equation 
[1.1]  provides  an  expression  for  o: 
a(l+a-p  -p  ) 

*12  (p  -a) (p  -a)  '  [2.6] 

I  t  s 

This  determines  a  uniquely,  because  the  quadratic 

|»(pr-o)  (pg-a)  -  a(l+a-pr-pg)  =  0 

i  has  only  one  root  in  the  interval  of  acceptable  a  values,  or 

max(pr,pg)  <  a  <  min(l,pr+p#) . 

This  can  be  seen  by  examining  Figure  1.1.  It  can  be  proved  by  noting  that 
the  right-hand-side  of  [2.6]  increases  from  0  at  a»max(px,pg)  to  infinity 
at  a«*min(l,pr+p#) .  It  therefore  takes  the  value  fj2  »n  odd  number  of 

! 


times.  A  quadratic  can  have  no  more  than  two  roots,  so  there  is  a  single 


root  in  this  interval. 

2.3  A  mixing  property 

There  are  various  mixing  properties  that  determine  how  dependence  in  a 
stochastic  process  dies  off  as  a  function  of  time.  The  strongest  is 
•-mixing . 

Using  the  expressions  for  the  odds  ratio  between  observations  separated  in 
time  it  is  possible  to  bound  the  dependence  between  distant  observations. 
Specifically,  I  will  show  that  a  process  generated  by  the  serial  depen¬ 
dence  model  is  a  *-mixing  process.  I  will  use  this  property  later  to 
establish  consistency  of  the  maximum  likelihood  estimator. 

Definition:  A  process  is  defined  as  a  *-mixing  process  (Hall  and  Heyde, 
1980,  page  40)  if  there  exist  a  number  N  and  a  function  f  defined  on  the 
positive  integers  such  that 

(1)  f(n)  is  non-increasing  in  n  for  n>Ni 

(2)  f(n)  approaches  0  as  n  approaches  infinity) 

(3)  for  all  t,  given  any  event  A  in  the  o-field  generated  by 
{Y^,..., Y^}  and  any  event  B  in  the  o-field  generated  by 
{Yt+n, . . . } , 

|P(AB)  -  P(A)  P(B) |  <  f(n)  P(A)  P(B), 
where  AB  is  the  intersection  of  A  and  B.  By  the  Markov  property,  it  is 
sufficient  to  consider  only  A={Yt«l}  or  (Yt®0)  and  B={Yt+n=l}  or  (Yt+n=0). 


Take  any  fixed  t  and  let  be  the  odds  ratio  between  Yt  and  Yt+n.  Let 

A={Yt=l}  and  B={Yt+n=l}.  Then 

_  P(AB) [1+P(AB)-P(A)-P(B) ] 

*n  [P(A)-P(AB)][P(B)-P(AB)1  ' 

.  _  P(AB)-P(A)P(B) ] 

tP(A)-P(AB)][P(B)-P(AB)l  ' 

|P(AB)-P(A)P(B) I  =  lfn-l|  [P(A)-P(AB) ]  [P(B)-P(AB)1 
<  P(A)  P(B) . 

The  other  combinations  of  A  and  B  can  be  treated  similarly,  and  two  of 
these  combinations  lead  to  the  bound  l(l/|>n)~ll.  Therefore 

|P(AB)-P(A)P(B) I  <  max{|fn-ll.l(l/fn)-l|}  P(A)  P(B) , 
so  it  remains  to  show  that  pn  approaches  1  uniformly  in  t.  I  will  show 
this  for  n=2 j  by  induction  on  j.  Then  I  will  extend  the  result  to  other 
values  of  n  by  showing  |^n+j-l|  -  lpn“ll  for  »H  “• 

Suppose  f>l  (the  proof  for  p<l  is  similar).  The  expressions  given  above 

for  show  that  if  p>l,  then  fn>l  for  all  n.  Therefore  0  is  a  lower 
bound  on  (fn~D  for  all  n  in  the  following  proof. 

For  each  t  there  is  a  quantity  u  (equal  to  °2(1)  in  f^e  1°*  linear  expan¬ 

sion  carried  out  above)  such  that 

cosh  2u  +  cosh  log  p 
^2  cosh  2u  +  1 


It  is  easy  to  see  that  if  a,  b,  and  c  are  positive  and  if  b>c,  then 
( a+b ) / ( a+c )  is  a  decreasing  function  of  at  its  derivative  with  respect  to 
a  is  (b-c ) / ( a+c ) * .  Since  cosh  x  2  1  for  all  x,  it  therefore  follows  that 
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for  all  t. 


1  +  cosh  log  «> 
2 


and  therefore  ({>2~  1)  <  <f-l)/4. 


Now  for  a  fixed  positive  integer  j,  let  n»2-J+*  and  k*2-^ .  and  suppose  for 
all  t 

(Vn/2_1)  - 

Fix  t  and  let  and  be  the  odds  ratios  for  the  pairs  ^t*^t+n/2^  *n<1 
(Yt+ny2»Yt+Q) ,  respectively.  Then  there  is  a  value  u  such  that 

cosh  2u  +  cosh  [  y  log  +  y  log  Fgl 

fn  =  - i - 1 -  • 

cosh  2u  +  cosh  [  y  log  ~  y  1°I  Fgl 

Then  for  all  t, 

|>n  i  y  +  y  COsh  [  y  log  (l+k(f-l))  +  y  log  (l+k(f~l))  ] 

=  y  +  l  [  (l+ktf-l))  +  (l+ktv-l))"1  ] 

<  I  +  I  (i+kCv-l))  , 

which  implies  g»n~l  -  k(f-l)/4  =  (g»-l)n-^.  Therefore  by  induction,  if  f(n) 
is  defined  as  (y-l)n~^,  then  (f>n-l)  i  f(n)  when  n  is  a  power  of  2. 


It  reaiains  to  show  that  f  can  be  suitably  defined  when  n  is  not  a  power  of 
2.  The  above  argument  shows  lim  inf  (fa~D  “  0,  so  it  is  sufficient  to 
show  that  lvn+i~ll  <  lfn-l|.  But  for  any  given  t  there  is  a  quantity  u 
such  that 


cosh  2u  +  cosh  [  y  log  f  +  y  log  y>] 
n+*  cosh  2u  +  cosh  [  y  log  Fn  ~  y  log  f>3 


■  +V  )  <Pn  ~Vn  >  *  Wn~  1'fn>  • 

which  is  strictly  positive  if  |>n>l.  (If  f>n=l,  pn+i=l*) 


Therefore  if  f  is  defined  as 


£(n)  -  W* 

.  f(n-l) 


if  a  is  a  power  of  2, 
otherwise. 


[2.7] 


then  f  satisfies  the  three  conditions  in  the  definition  of  a  '-nixing 
sequence.  Repeating  this  proof  for  the  case  qKl  gives  the  following 
proposition. 


Proposition:  The  serial  dependence  model  generates  '-nixing  sequences. 


2.4  Some  numerical  calculations 

Figures  2.1  through  2.6  show  the  log  of  the  odds  ratio  between  Yt  and  Yt+n 
for  linilO,  calculated  for  some  special  cases  using  the  formulas  derived 
in  this  chapter.  The  step  functions  are  obtained  from  the  upper  bound  f 
on  (fn~l)  derived  in  the  previous  section,  equation  [2.7].  For  each  curve 
the  marginal  probability  pt  takes  the  constant  value  0.5  for  all  t. 
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For  plots  with  f<l,  f  is  obtained  by  applying  [2.7]  to  the  process  {Zt }  = 
,1-Y2 .Y3 , I-Y4 , . . . ) i  each  even-numbered  term  is  changed.  If  the  origi¬ 
nal  process  has  odds  ratio  $>st  between  Yg  and  Yt,  the  odds  ratio  betwen  Zg 
and  Zt  is  if  s-t  is  even  and  l/f#t  if  s-t  is  odd.  Therefore  the  upper 
bound  obtained  from  f,  and  a  lower  bound  that  is  the  inverse  of  this  upper 
bound,  bound  the  odds  ratio  of  the  original  process. 


m 

i 


Chapter  3 


Consequences  of  Ignoring  Serial  Dependence 

In  linear  regression  with  serially  correlated  errors,  coefficients  esti¬ 
mated  by  ordinary  least  squares  are  consistent,  but  the  estimated  standard 
errors  are  not  correct.  In  this  chapter  I  will  show  that  the  same  holds 
for  the  serial  dependence  model.  I  will  give  some  theoretical  indication 
that  the  ordinary  logistic  coefficient  estimates  are  consistent  estimators 
of  the  coefficients  in  the  serial  dependence  model,  and  I  will  verify  this 
with  a  simulation.  I  will  also  perform  another  simulation  in  which  I  will 
show  that  confidence  intervals  computed  using  the  standard  errors  from  the 
logistic  model  do  not  have  the  correct  coverage  probabilities. 

3.1  Coefficient  estimates 

Suppose  a  process  is  generated  by  the  serial  dependence  model  with  unknown 
coefficients  £q.  I  will  show  that  the  nearest  ordinary  logistic  model  to 
the  serial  dependence  model,  in  the  sense  of  minimum  Kullback-Leibler  dis¬ 
tance,  is  the  one  with  the  same  coefficients.  Since  these  coefficients 
maximize  the  expected  values  (under  the  model  that  generated  the  process) 
of  log  of  the  ordinary  logistic  likelihood,  this  is  an  indication  that  the 
ordinary  logistic  coefficient  estimates  should  converge  to  the  true 
values. 


Let  f  be  the  density  function  in  the  true  model,  and  let  (pt)  and  {atJ  be 


31 


the  usual  marginal  and  joint  probabilities  in  that  model.  Let  g  be  the 
density  function  for  an  ordinary  logistic  model,  and  let  (qt)  be  the  mar¬ 
ginal  probabilities  in  that  model.  They  can  be  written 

log  g  «  log  PlTj^yjHlog  P[Y2=y2lTi=yi]  +  ...+log  HYn=yn lY^-y^] 

=  It  lyt  log  qt  +  (l-yt)  log  (l-qt)J 


log  f  =  y2  log  P2  +  (1-yj)  log  (1-Pi)  +  It>2  yt_iyt  lo*  (ot/Pt-l) 

+  (i-yt-i>yt  iog((pt-at)/(i-pt_1))  +  yt-i(1~yt)  i°*(Pt-i_at)/pt-i 
+  (l-yt_2) (l~yt)  log  ( (l+at-pt-pt_2)/ (l-pt_2) )  . 

Then  the  Kullback-Leibler  distance  is 


Kfg  =  Ef  [  log  (f/g)  ] 

P2  I"! 

-  log  —  .  <1-Pl)  lo,  -j- 


K  ,  “t 

)  a  log  - 

L  t  *  p  ,q 


t-l^t 


prat  pt-r°t 

+  (p  ~<*  )  log  ; —  +  (p,  i-®*)  log - rq - \ 

t  t  vl-ptl)  qt  t-1  t  •  pt-l'1_9t 

1+0t"pt-l_pt 

*  '““t-'t-r’t1  1o* 


[3.1] 


To  minimize  this  distance  it  is  helpful  to  collect  terms  and  write 
Kfg  “  A  -  pj  log  qx  -  (1-pj^)  log  (1-qj)  -  {It>2  at  log  qfc 

+  (pt~at)log  qt  +  (pt_1-at)log  (l-qt>  +  d+a^p^-p^log  (l-qt ) } 

=  A-pjlog  qj-d-PjJlogd^J-tl^j  ptlog  qt+(l-pt)log(l-qt)  J  , 

where  A  is  a  function  of  {ptJ  and  [at]  but  not  (qt).  The  derivative  with 
respect  to  qt  is 


which  is  0  if  and  only  if  pt  =  qt .  The  second  derivative  is  positive  for 
all  values  of  pt  and  q^  between  0  and  1,  so  the  Kullback-Leibler  distance 
is  a  minimum  if  qt=pt  for  all  t.  This  condition  is  satisfied  if  the  co¬ 
efficients  in  the  two  models  are  the  same,  so  the  following  proposition  is 
proved. 

Proposition :  The  closest  ordinary  logistic  model  to  any  serial  dependence 
model  is  the  one  with  the  same  coefficients,  if  distance  is  measured  by 
the  Kullback-Le ibler  distance  using  the  serial  dependence  model  as  the 
true  model. 

Let  X  -  log  f,  and  suppose  the  coefficients  in  the  two  models  are  the 
same.  The  value  of  the  Kullback-Leibler  distance  between  the  two  models 
is  well  approximated  by  a  fourth  order  Taylor  series  around  X=0.  If  X-0, 
then  because  the  two  models  coincide.  From  the  proof  of  the  propo¬ 

sition,  dKfg/dX  =  0  at  X=0.  Taking  derivatives  in  [3.1]  and  collecting 
terms  with  logarithms  gives 


Equation  [1.2]  or  [2.6]  defines  a  in  terms  of  y  and  the  marginal  proba 
bilities.  The  first  derivative  of  at  with  respect  to  f  is 


da 


__l  =  (pt~at)<pt-rat) 

df  l+(f-l)(pt-ptl-2at)  ' 


[: 


Higher  derivatives  can  be  obtained  from  this  expression.  At  X=0,  the 
derivatives  simplify  to 

da/df  =  Pt(l-Pt>  P^jd-P^j) 


d2a/d^2  =  -2  (da/ df)  [p^l-p^j)  +  pt-l^~Pt^ 

d3a/df3  =  6(da/df)  Ip^<l-Pt_1)2+Pt_i(1_Pt)2+3Pt(1_Pt)Pt_1(1~Pt_1> 


Therefore  the  Taylor  series  up  to  X*  gives 


IfgU)  *  (1/2)  X2  l  Ptfl-P^Pt-iU-Pt-j) 

+  (1/3)  X3  l  Pt(l-pt)(l-2pt)pt_1(l-pt_1)(l-2pt_1) 

+  (1/8)  X4  l  pt(l-pt)pt_1(l-pt_1)  [l-6pt(l-pt_1)-6pt_1(l-pt) 

+6p2(l-pt_i)2+6p2_i(l-pt)2+l8pt(l-pt)pt_i(l-pt_i)J  , 

so  it  follows  that 

Kfg(X)  =  (1/2)  X2  l  pt(l-pt)pt_1(l-pt_1) 

[l+(2/3)X(l-2pt)(l-2pt_1)+0(X2)] .  [3.3] 

This  shows  that  given  a  sequence  of  {ptJ  such  that  {pt~l/2}  and  (pt_j-l/2] 
tend  to  have  the  same  sign,  the  logistic  model  is  closer  to  models  with 
negative  X  than  to  those  with  positive  X.  The  converse  is  true  if 
(pt-l/2]  tends  to  oscillate  in  sign. 

Figure  3.1  shows  the  exact  Kullback-Leibler  distance  between  the  two 
models  for  one  example.  Here  I  generated  100  independent  normal  random 
variables  (Xt)  and  used  an  intercept  and  slope  both  equal  to  1.0.  I 
calculated  the  Kullback-Leibler  distance  for  various  values  of  the  odds 
ratio  equally  spaced  on  the  log  scale  between  0.1  and  10.  The  curves  are 
values  given  by  the  first  one,  two,  and  three  non-zero  terms  in  the  Taylor 
series  in  X.  The  series  up  to  X4  gives  a  good  fit  to  the  exact  distances 
except  for  very  low  values  of  the  odds  ratio,  where  the  two  models  are 
more  distant  than  the  approximation  suggests. 

To  examine  the  coefficient  estimates  using  the  logistic  model  with  data 
generated  under  the  serial  dependence  model,  I  performed  a  simulation. 

The  conditions  were  identical  to  those  described  in  the  previous  para- 


ure  3.1.  Observed  Rejection  Probability 
for  Various  Values  of  the  Odds  Ratio 


Odds  ratio 


graph.  The  results  are  shown  in  Table  3.1. 

It  appears  that  each  of  the  quantities  in  the  table  is  an  increasing  func¬ 
tion  of  the  odds  ratio.  The  quantities  associated  with  the  intercept, 
however,  increase  such  faster.  The  sample  bias  and  standard  deviation 
both  double  as  the  odds  ratio  moves  from  0.1  to  10.  For  the  slope,  the 
bias  increases  by  a  substantial  proportion  but  the  standard  deviation  is 
more  nearly  constant. 

The  bias  is  too  small  a  fraction  of  its  estimated  standard  error  to  con¬ 
firm  the  effect  with  a  hypothesis  test.  For  example,  the  difference  be¬ 
tween  the  bias  at  f=10  and  that  at  f=0.1  is  about  equal  to  its  estimated 
standard  error.  However  the  obvious  trend  indicates  that  the  observed 
difference  is  not  an  artifact  of  the  simulation. 

3 .2  Standard  Errors 

The  ordinary  logistic  coefficient  estimates  differ  from  the  true  coeffi¬ 
cient  values  only  by  a  small  fraction  of  their  standard  deviations.  How¬ 
ever  Table  3.1  shows  that  the  standard  deviations  of  the  intercept  esti¬ 
mates  change  by  a  factor  of  two  as  the  odds  ratio  changes  from  0.1  to  10. 
This  is  an  indication  that  the  standard  errors  produced  by  the  logistic 
model  cannot  be  correct.  Therefore  inferences  about  the  coefficients  are 
suspect . 

This  fact  is  more  apparent  in  Figure  3.2.  This  shows  the  result  of  a 
second  simulation,  with  1000  observations  at  each  value  of  the  odds  ratio 


Table  3.1.  Ordinary  Logistic  Estimates 

Observed  bias  and  standard  deviation  from  a  sample  of  400  ordinary  logis 
tic  estimates  for  each  value  of  the  odds  ratio.  Each  regression  has  100 


observations  and  a  single  explanatory  variable.  The  true  intercept  and 
slope  values  are  both  1.0. 


Odds  - Intercept - 

Ratio  Bias  Std  Dev 


Slope 


Std  Dev 


Bias 


e  3.2.  Observed  Rejection  Probability 
for  Test  of  True  Intercept  Value 


Odds  ratio 


bat  otherwise  with  conditions  identical  to  those  in  the  previous  simula¬ 
tion.  In  each  sample  I  counted  the  number  of  'misses, "  or  the  number  of 
times  the  true  value  of  the  intercept  was  not  within  1.96  standard  errors 
of  the  estimated  value,  with  both  estimates  and  standard  errors  from  the 
ordinary  logistic  model.  This  is  a  test  of  size  0.05  using  the  asymptotic 
normal  distribution  of  the  estimates. 

a 

For  each  sample  the  proportion  0  of  misses  is  an  estimate  of  the  size  6  of 
the  test.  Since  the  number  of  misses  follows  a  binomial  distribution,  a 
95  percent  confidence  interval  for  the  size  is  fl  -  1.96(0(l-0)/lOOO)*^. 
The  Figure  shows  this  confidence  interval  for  each  sample,  along  with  a 
line  marking  the  nominal  0.05  level. 

Clearly  this  test  does  not  have  the  proper  size  if  the  odds  ratio  is  not 
equal  to  1.  For  smaller  odds  ratios  the  estimated  standard  errors  in  the 
logistic  model  are  too  large,  so  the  intercept  estimates  are  more  accurate 
than  they  appear.  For  larger  odds  ratios  the  situation  is  worse.  Confi¬ 
dence  intervals  computed  using  the  ordinary  logistic  model  are  too  opti¬ 
mistic  i  their  coverage  probabilties  are  much  smaller  than  their  nominal 
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Chapter  4 


Maximum  Likelihood  Estimation 


In  this  Chapter  I  write  the  likelihood  function  and  its  derivatives.  I 
give  conditions  that  imply  consistency  of  the  maximum  likelihood  esti¬ 
mates,  and  I  describe  a  method  for  finding  these  estimates.  I  also  com¬ 
pare  the  maximum  likelihood  estimator  with  some  other  estimators. 


4.1  Solution  of  the  likelihood  equations 

If  the  likelihood  equation  is  written  as  the  product  of  conditional  like¬ 
lihoods,  its  logarithm  can  be  written 


v  °t 

l  =  log  Pj  +  d-yj)  logd^)  +  2  p — 


pt"at 


+  yt(1_yt-l)  log  TT~  +  <1_yt)yt-ll0g  — 


pt-rat 


+  (l-yt) (l-yt_1>  log 


1+VPt‘Pt-l 
^  ’ 


[4.1] 


where  at  =  Prob[Yt=Yt_j=l] .  This  quantity  is  defined  by  equation  [2.6], 
and  its  derivatives  are 

da  (pt~at)(pt-rat) 

df  1  —  ( f>- 1  >  (2ot~pt-pt_1 ) 

9a  _  Pt(1~Pt)ltI<>pt-r(<,"1)at1  4  pt-l(1~pt-l),t-lI^pt~(f~1)°t1 
dp  l-<f*-l )  (2at-Pt-Pt-i) 

The  derivatives  of  the  log  likelihood  are  most  easily  expressed  in  terms 


~pt  xt  |  pt~°t  J 

>,_1<1*V  a-yt)  <!-,,_!>  | 
+  ’.-.“-’t-i'Vi  [  lt‘rat  -  j 

n-1 

- }  pt<n>t>it  • 


Here  again  the  terms  in  brackets  have  zero  expectation,  so  this  simplifies 
the  expresssion  for  the  information  matrix  somewhat. 


pjl-pj(l-p 


<Pt-at) <1+«t_Pt"Pt_1)  (Pt_1~at) (l+ot-pt-p 

Pt<1-»t'pt-l<*-p,-i)  . 

— lta.-prpt~ — 

n-1 

-  }  pto-Pt)  . 


4.2  Consistency  of  the  maximum  likelihood  estimates 


Most  proofs  of  the  consistency  and  asymptotic  normality  of  maximum  likeli 
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hood  estimators  involve  assumptions  about  and  manipulation  of  the  Fisher 
information  matrix.  This  is  difficult  here  because  the  information  is  un¬ 
wieldy.  However  a  proof  of  consistency  is  possible  under  some  conditions. 

Vald  (1949)  proved  the  consistency  of  the  maximum  likelihood  estimator  for 
independent  and  identically  distributed  random  variables  under  certain 
regularity  conditions.  A  simplified  version  was  given  by  Chernoff  (1972). 
These  proofs  did  not  make  use  of  the  derivatives  of  the  likelihood  func¬ 
tion.  A  similar  proof  can  be  used  here  by  noting  that  the  log  of  the 
likelihood  ratio  (of  the  likelihood  at  an  alternative  parameter  value  to 
the  likelihood  at  the  true  parameter  value)  is  a  supermartingale,  and  by 
applying  a  strong  law  of  large  nus&ers. 

The  strong  law  requires  some  degree  of  independence.  As  was  proved  in 
Chapter  2,  if  (Yt)  is  generated  by  the  serial  dependence  model  then  it  is 
a  *-mixing  sequence.  It  is  easy  to  see  that  if  (YtJ  is  also  a  Markov  pro¬ 
cess,  then  f (Yt, . . . ,Yt+g)  is  a  *-mixing  sequence  for  any  function  f  and 
integer  s.  Strong  laws  are  available  for  such  sequences. 

To  prove  consistency  it  is  necessary  to  be  able  to  distinguish  the  true 
parameter  value  ®Q=(log  9q,0q)  *ro“  *nY  alternative  value  0=(log 
(Values  below  calculated  at  0=9q  are  given  the  subscript  0.)  To  do  so  it 
is  necessary  that  the  conditional  probabilities  n^Q  **  (ntQ(0) ,ntQ(l) )  = 
(P[Yt=l lYt_i“O,0Q] ,P[Yt«=l lYt_j=l,00] )  be  ~casionally  different  from  the 
probabilities  n(  *  (nt(0) ,nt(l) )  computed  under  the  alternative  6.  Refer¬ 
ence  to  Figure  1.1  indicates  that  these  conditional  probabilities  coincide 


if  f=f0  and  if  (Pt-l,0'PtO'Pt-l*pt) '  where 
log  [pt/(l-pt)]  =  X£P  . 
falls  on  a  set  Sp=Sp(0g,0)  for  which 

Pt-1,0  "tO*1*  +  (1_Pt-l,0)  "t0(0)  =  pt0 
pt-l  "t0(1)  +  (1"Pt-l)  nt0(0)  =  Pt 
nto(1)(1-nto(0)) 


(i-"to(1»”to<0)  ‘  ,0 


for  some  n^g, 


The  set  Sp  corresponds  to  a  set  Sz=Sz(0g,0)  in  the  space.  As 

long  as  a  substantial  proportion  of  the  pairs  (Xt_j,Xt)  are  removed  from 
Sz,  one  may  expect  to  be  able  to  discriminate  between  0g  and  0. 

Consistency  follows  from  the  following  assumptions. 

[Al]  Bounded  covariates:  there  exists  a  positive  H  such  that  for  all  t, 

IXtl<M. 

This  assumption  implies  the  existence  of  a  positive  p*  such  that  for  all 
t,  p*  <  min(ptg,l-ptg) .  This  in  turn  implies  the  existence  of  pnin  such 
that  0<Pmin-P*  »ad  fo*  »11  t,  PBin  <  min(ntg(l) ,l-ntg(l) ,ntg(0) ,l-ntg(0) ) 

[A2]  Identifiability:  given  P^Pg,  there  exist  ese<Pg,p)>0,  ^^(Pg.p)# 
and  T=T(Pg,p)  such  that 

n  <  n"1  #lt:  2<t<n  and  d( <X*_, ,X*) ,S_)  >  s),  [4.8 
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#k  is  the  nuaber  of  elements  in  the  set  A, 

d(z,S)  is  the  ainiaua  distance  between  z  and  S,  and 

Sx=Sx(0Q,0),  with  0=(log  fQ,p)‘ 

This  assumption  states  that  a  non-negligible  proportion  of  the 
pairs  do  not  lie  arbitrarily  close  to  the  manifold  on  which  5fc*2t0*  If 
appears  awkward,  bnt  following  the  proof  1  will  expand  upon  this  condition 
and  give  other  conditions  that  imply  it. 

[A3]  Compact  parameter  space :  the  true  values  of  the  parameters  can  be 
assumed  to  lie  in  a  known  compact  region. 

With  this  assumption  we  can  consider  the  restricted  maximum  likelihood 
estimator  subject  to  membership  in  the  compact  set. 

The  first  two  assumptions  imply  the  following  two  lemmas.  The  first  shows 
that  a  non-negligible  proportion  of  the  nt's  are  bounded  away  from  ntg. 

The  second  shows  that  a  non-negligible  proportion  of  the  log  likelihood 
ratios  are  bounded  below  zero. 

Lemma  1_.  If  Assumptions  [Al]  and  [A2]  are  satisfied  and  if  0=(log  f>Q,0), 
then  there  exist  *nd  T=T(0q,0)  such  that  for 

all  n>T. 

t|  <  n-1  # { t :  2itin  and  d(nt,5tQ)2s} .  [4.9] 

Proof:  The  set  Sx  is  the  set  of  for  which  Let 
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De  *  {<Xt-l*Xt):  IXJ1*'  i((Xt-1.It).8t)>.).  14.10] 

Since  d(nt#ntQ)  is  a  continuous  function  that  ia  positive  on  the  compact 
set  De<  it  achieves  some  positive  minimum  value  ejCe.Og.O)  on  it.  Assump¬ 
tion  [A2]  therefore  establishes  the  lemma. 

Lemma  2.  If  Assuaq>tion  [Al]  and  the  conclusion  of  Lemma  1  hold,  there 
exist  t|*i)(9o,6)>0,  e2=e2(0Q,0)  and  T=T(0q,0)  such  that  for  all  n>T, 

H  <  n-*  #{t:  2itin  and 

E[log  fjdjlT^j.ej-log  ft(TtlTt_1(«0)]  <  -e2).  14.11] 


Theorem.  Let  ©g  =  (log  {>g,pg)  be  the  true  value  of  ©  aad  let  C  be  any 
coapact  subset  of  the  parameter  space  that  does  not  contain  ©g.  Then 
assumptions  [Al]  and  [A2]  imply 


lim 

n->® 


sup 

©eC 


f,(YjY ,,•)  •••  f  (Y  |Y 
2  2  1  n  n 

T^Y-lY.,©.)  •••  f  (Y  IY 
2  2  10  n  n 


n-1 

n-1 


=  0 


w.p. 


1. 


Proof :  Note  that  the  above  event  is  equivalent  to 
n 

lim  sup  ^  log  ft(YtlYt_re)  -  log  f (Yt lY  ^.©g)  -  — .  [4.1 

n->  “>  ©eC  t_2 

I  will  show  that  this  equality  holds  almost  surely. 


Pick  any  6  in  C.  Define 

*  *  *  1  l©'~e|<p  ‘  1  t_1 


and 

ut(e,P)  =  log  ft(YtlYt_1,©,P)  -  log  ft(YtlYt_1,e0). 

Dt(©,p)  is  an  upper  bound  on  the  log  of  the  likelihood  ratio  in  a  neigh¬ 
borhood  of  ©.  More  specifically,  for  all  ©'e{©':  l©-©'l<p},  Ut(0,p)  > 
log  ft(Yt|yt-l*e')_1°*  ft(YtlYt_1,©0). 


There  are  two  cases:  either  ftf0,  or  f=g>0  but  In  the  second  case 

lemmas  1  and  2  hold.  If  f^fg,  then  there  is  some  positive  6  larger  than 
the  minimum  distance  between  the  curve 

{n=(n(0) ,n(l ) ) :  g>=n(l) (l-n(O) )/n(0) (l-n(l) ) , 

Pmin~n  ^  ® ^ ~*”Pmin  ^ 


and  the  curve 


{n=(n(0),n(l)) :  f0=n(l) (l-it(O) )/n(0) (l-n(l) ) , 

P.in-n(0)-1'P»in1- 

This  is  aost  easily  seen  by  ezaaining  Figure  1.1,  and  is  proved  by  noting 
that  these  cnrves  are  disjoint  coapsct  sets.  Then  for  all  t,  d(nt,nto)2ej 
for  soae  e2.  But  this  implies  the  conclusion  of  Leaaa  1,  so  Leaaa  2  holds 
for  this  case  as  veil. 

Pick  any  positive  e  small  enough  so  that  leaaa  2  applies,  and  let  i|  and  T 
be  as  in  that  leaaa.  Let  I  be  the  set  of  tiae  indices  such  that 
Ellog  ft(YtlYt_1#e)  -  log  ft(YtlYt_1,80)]  <  -e2. 

By  continuity  of  the  function  E[log  ft(YtlYt_2,0')]  on  the  compact  set 
{(0*,Xt_j,Xt):  0'eC,  lxt_j IlM,  |XtliM},  and  therefore  by  uniform  continu¬ 
ity  on  that  set,  there  exists  a  positive  p2sP2(0o>°)  such  that  for  all 
tel,  E[Ot(0.p)]<-e2/2  if  p<p2. 

Since  E[log  *t<Yt lYt-l '0)  -  log  ft(YtlYt_1,©0)] <0  for  all  t,  if  follows, 
again  by  uniform  continuity,  that  there  exists  a  positive  P2=p2(0o»0)  such 
that  for  all  til,  E[Ut(0,p)]  i  Tje2/4(1— n)  if  P<P2.  Hence  for  n^T  and 
P0iain(p2(0Q,0) ,p2(0q,0) ) • 

V  ✓  *2  T>8 2 

2  E[Ot(0,p)]  i  nij  +  n(l-ti)  *  -ni|e2/4  , 

t-2 

which  approaches  -•  as  n  ->  «. 

By  compactness,  there  is  a  finite  covering  of  C  by  sets  {0:  I ® — 0 j I <P© j ) , 
j=l,...,k.  For  every  0eC  there  is  a  jik  such  that 

I  Ot(0j.pej)  >  Z  log  ft(Yt |Yt_1#0)  -  log  ft(Y^|Yt_2,@0) , 


so  {0t(©j ,pgj ) } ,  j=l,...,k,  fora  a  finite  collection  of  random  sequences 
that  bound  all  the  log  likelihood  ratios  for  8eC.  Therefore  to  prove  the 
theorem  it  is  sufficient  to  prove  that  for  every  j*  2  Ut(6j,pQj)  al- 

aost  surely. 

Because  (Yt)  is  a  ^-mixing  process,  {D^}  is  also  a  *-mixing  process.  Be¬ 
cause  it  is  a  continuous  function  on  a  compact  set,  it  is  bounded,  so 
aa  n 

^  n“2  E[  (U  -ED  )2]  <  ®  and  sup  n"1  ^  E  lU  -EU  I  <  ®  . 

L  n  n  n  Z  t  t 

n=l  t=l 

Therefore  by  Theorem  2.20  of  Hall  and  Heyde  (1980),  n-*  £  [Ut-EUt]  ->0 

almost  surely.  But  I  EUt  — 9— ®,  so  the  theorem  is  proved. 

Assumption  2  above  requires  that  a  non-negligible  proportion  of  (Xt_j,Xt) 
be  at  least  some  minimal  distance  from  each  of  a  certain  family  of  mani¬ 
folds.  Unfortunately  a  smooth  manifold  could  be  put  through  any  finite 
collection  of  points,  so  this  condition  is  hard  to  check.  A  closer  exami¬ 
nation  of  the  manifolds  may  be  useful. 

Let  t>=y>0'  the  true  value  of  the  odds  ratio.  The  assumption  requires  that 
if  Pq  is  the  true  value  of  the  coefficient  vector,  for  any  other  vector  p 
a  non-negligible  proportion  of  the  time  (Xt_2»Xt)  lie  at  least  some 
minimal  distance  from  the  manifold  on  which  Given  X^-^'Pq  and 

Xt'PQ,  the  marginal  probabilities  P^-1,0  and  Pt,0  are  determined,  and 
therefore  is  determined.  But  reference  to  Figure  1.1  clearly 
indicates  that  for  any  value  of  p£_i  there  is  only  one  value  of  pt  that 
produces  a  given  n^.  Therefore  given  Xt_^'P,  Xt'P  is  determined. 


The  same  statments  apply  to  the  other  linear  combinations,  so  any  three  of 
the  quantities  Xt'0Q,  Xg.^'f),  Xt'0)  determine  the  other.  There¬ 

fore  each  of  these  qnantities  is  a  single-valned  function  of  the  other 
three.  There  are  two  conditions  under  which  points  cannot  be  restricted 
to  be  on  this  manifold: 

1.  (Xt_ltXt)  is  a  random  variable  with  a  non-degenerate  distribu¬ 
tion,  so  it  is  not  concentrated  along  a  lower  dimensional  mani¬ 
fold  with  probability  one. 

2.  There  are  integers  t  and  s  such  that  Xt=Xg  but  either  ^t-l^s-1 
or  Xt+1«,+1. 

The  second  condition  is  likely  to  hold  if  the  X's  can  take  only  a  finite 
collection  of  values,  or  if  they  are  the  result  of  an  experimental  design. 
The  first  condition  is  likely  to  hold  under  a  variety  of  circumstances. 

The  assumption  is  somewhat  stronger;  it  requires  that  a  non-negligible 
proportion  of  the  (Xt_^,Xt)  be  bounded  away  from  the  manifold.  This  con¬ 
dition  would  likely  be  satisfied  if  the  X's  take  finitely  many  values  or 
come  from  an  experimental  design.  It  would  also  hold  if  the  (Xt)  were 
independent  and  identically  distributed  with  a  non-degenerate  distribu¬ 
tion,  and  would  probably  hold  under  milder  conditions  as  long  as  the 
dependence  is  not  too  great  and  the  distributions  do  not  converge  to  a 
degenerate  distribution. 
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4.3  Comparison  of  Estimators 

Using  the  expression  for  the  first  derivatives  and  the  information  matrix 
from  section  1  of  this  chapter,  it  is  not  difficult  to  find  the  maximum 
likelihood  estimates  of  f  and  0  by  Fisher's  scoring  method.  However  the 
results  in  the  previous  chapter  suggest  the  ordinary  logistic  estimates  of 
0  may  be  adequate.  In  this  section  I  perform  a  simulation  to  compare 
estimators  of  0  and  f. 

In  this  simulation  I  generated  samples  of  100  observations  Yt  for  which 
the  marginal  probabilities  pt  satisfied  log  (pt/(l-pt))  =  1  +  xt,  where 
the  {x^}  were  independent  standard  normal  random  variables.  I  generated 
200  such  samples  for  each  of  21  values  of  f>  equally  spaced  on  the  log 
scale  between  0.1  and  10. 

For  each  sample  I  estimated  the  coefficients  both  by  ordinary  logistic  re¬ 
gression  and  by  maximum  likelihood  estimation  for  the  serial  dependence 
model.  I  estimated  the  log  odds  ratio  by  three  methods: 

1.  Unrestricted  maximum  likelihood  estimation  for  the  serial  depen¬ 
dence  model  (UMLE) . 

2.  One  iteration  of  the  scoring  method,  starting  with  the  ordinary 
logistic  coefficient  estimates  and  with  f=l  (1STEP).  (This  is 
equivalent  to  setting  the  score  statistic,  defined  in  Chapter  5, 
equal  to  its  expected  value  and  solving  for  |>.) 

3.  Restricted  maximum  likelihood  estimation,  with  the  coefficients 
constrained  to  their  ordinary  logistic  estimates  (RMLE) . 


The  simulation  results  are  summarized  in  Tables  4.1-2  and  Figures  4.1-3. 


Table  4.1  shows  the  correlation  between  those  estimators  that  estimate  the 
same  quantity.  In  each  case  the  correlation  seems  to  vary  with  the  odds 
ratio,  with  lower  correlations  for  odds  ratios  far  from  1.  Most  correla¬ 
tions  exceed  0.9  when  the  odds  ratio  is  not  far  from  1.  The  correlation 
between  RMLE  and  UMLE  is  quite  near  1  for  a  wide  range  of  odds  ratios,  so 
this  is  an  indication  that  if  the  odds  ratio  is  not  expected  to  be  extreme 
a  priori,  the  restricted  maximum  likelihood  estimator,  which  is  simpler  to 
compute,  may  be  as  good  as  the  unrestricted  maximum  likelihood  estimator. 

If  the  odds  ratio  is  known  a  priori  to  be  neither  zero  nor  infinity,  the 
1STEP  estimator  has  an  additional  advantage:  it  always  gives  a  finite  es¬ 
timate  of  the  odds  ratio.  Perfect  association  occurred  in  many  samples, 
so  the  RMLE  and  UMLE  estimates  of  the  log  odds  ratio  are  infinite. 

Table  4.2  shows  the  observed  bias  and  standard  deviation  for  the  coeffi¬ 
cient  estimates.  As  was  shown  in  the  previous  chapter,  the  standard  devi¬ 
ations  of  the  intercept  increase  with  the  odds  ratio,  and  this  phenomenon 
is  apparent  here  as  well.  The  bias  of  the  slope  estimates  is  smaller  for 
most  of  the  maximum  likelihood  estimates,  but  there  is  no  noticeable  pat¬ 
tern  to  the  other  quantities.  Because  the  biases  are  small  in  comparison 
with  the  standard  deviations,  a  much  larger  sample  would  be  required  to 
test  for  a  significant  difference  between  the  logistic  and  maximum  like¬ 


lihood  biases. 


I 


T 
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Table  4.1.  Correlations  of  Estimates 
For  each  value  of  the  odds  ratio,  the  table  gives  the  correlation  between 
the  given  pair  of  estimates  in  a  sample  of  size  200.  The  last  column 
gives  the  number  of  observations  with  perfect  association)  for  these  sam¬ 
ples  the  DMLE  and  RMLE  values  are  zero  or  infinity.  These  observations 
were  excluded  in  computing  the  correlations. 


Odds 

Ratio 


Constant  Slope 


1STEP/ 

RMT.R 


1STEP/  RMLE/ 

DMLE  DMLE 


Perfect 
Assoc iation 
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Table  4.2.  Observed  Bias  and  Standard  Deviation 
The  elements  in  the  table  are  the  sample  bias  and  standard  deviation  of 
the  coefficient  estimates  in  the  simulation.  Observations  with  perfect 
association  are  omitted. 


-  Constant  -  -  Slope  - 

Odds  Logistic  MLE  Logistic  MLE 


Ratio  Bias  Std  dev  Bias  Std  dev  Bias  Std  dev  Bias  Std  dev 


.10 

.0047 

.208 

.0024 

.201 

.0211 

.309 

.0153 

.273 

.13 

.180 

.0694 

.327 

.0550 

.289 

.16 

.0224 

.212 

.017  6 

.207 

.288 

.0125 

.275 

.20 

-.0088 

-.0121 

.198 

.0352 

.254 

.0249 

.265 

.25 

.0409 

.247 

.0414 

.253 

.0793 

.0680 

.320 

.32 

.0476 

.235 

.0492 

.242 

.0740 

.299 

.0638 

.316 

.40 

.0185 

.211 

.0169 

.212 

.0378 

.355 

.0298 

.339 

.50 

.0314 

.227 

.0299 

.224 

.0471 

.291 

.0394 

.296 

.63 

.0290 

.232 

.0295 

.236 

.0345 

.347 

.0363 

.342 

.79 

.0369 

.274 

.0371 

.276 

.0360 

.286 

.0381 

.290 

1.00 

.0451 

.285 

.0471 

.288 

.0605 

.294 

.0681 

.307 

1.26 

.0323 

.281 

.0297 

.281 

.0523 

.323 

.0504 

.330 

1.58 

.0312 

.306 

.0289 

.306 

.0077 

.294 

.0827 

.310 

.0842 

.311 

.0676 

.0676 

.319 

2.51 

.0547 

.299 

.0525 

.0234 

.276 

.0218 

.276 

3.16 

.0213 

.0155 

.308 

.0452 

.302 

.0355 

.307 

3.98 

.0687 

.317 

.0657 

.319 

.0911 

.0877 

.297 

5.01 

.0649 

.378 

.0574 

.373 

.0474 

.317 

.0330 

.285 

6.31 

.0672 

.426 

.431 

.0617 

.330 

.0675 

.308 

7.94 

.0150 

.327 

.0139 

.322 

.0408 

.237 

.0388 

.230 

.0416 

.395 

.0351 

.383 

.283 

.0546 

.265 

Figure  4.1.  Stem  and  Leaf  Display  of  Log  Odds  Ratio  Estimates 
True  Log  Odds  Ratio  is  0.0 
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Figure  4.2.  Stem  and  I  .af  Display  of  Log  Odds  Ratio  Estimators 
True  Log  Odds  Ratio  is  0.69 
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Unrestricted  MLE.  Values  off  plot:  -1.0  2.3  4.7 
2  -0.8 

5  -0.766 

9  -0.4444 

17  -0.33333222 

28  -0.11111110000 
45  +0.00000000111111111 

69  +0.222222222233333333333333 

87  +0.444444444455555555 

(38)  +0.66666666666666666677777777777777777777 
75  +0.888888888888889999999999 

51  1.00000000111111111 

34  1.222222223333 

22  1.445 

19  1.666667777 

10  1.889999 

4  2.00 


Figure  4.3.  Stem  and  Leaf  Display  of  Log  Odds  Ratio  Estimators 
True  Log  Odds  Ratio  is  2.30 


One-Step  Estimate.  Values  off  plot:  0.1  0.2  3.6 
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Restricted  MLE.  Values  off  plot:  0.1  0.1  4.3 
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Unrestricted  MLE. 


Values  off  plot:  5.5  5.5  5.6  5.6  5.8 
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Figures  4.1-3  show  stem  and  leaf  plots  of  the  log  odds  ratio  estimators 
for  three  values  of  the  odds  ratio:  1,  2.  and  10.  In  Figure  4.1,  with 
9=1,  there  seems  to  be  little  difference  between  the  estimators  based  on 
this  display,  except  the  number  of  extreme  values  (*values  off  plot*  in 
the  figures)  is  smallest  for  1STEP  and  largest  for  UMLE.  The  medians  of 
the  estimators  are  about  the  same,  and  all  are  close  to  the  true  value. 

In  Figure  4.2  the  situation  is  little  different.  There  is  a  very  large 
value  of  UMLE,  but  otherwise  the  shapes  of  the  distributions  appear  to  be 
very  similar.  Again  the  medians  are  about  the  same,  and  all  are  close  to 
the  true  value. 

Figure  4.3  shows  a  more  dramatic  difference.  The  estimator  1STEP  retains 
the  bell  shape  it  assumed  in  the  other  figures,  but  RMLE  has  a  noticeably 
heavier  upper  tail.  The  shape  of  UMLE  is  even  more  shewed,  with  eleven 
extreme  values  off  the  high  end  of  the  plot. 

The  median  of  1STEP  seems  to  be  a  little  smaller  than  the  true  value, 
while  UMLE  is  a  little  larger.  However  for  five  observations  (not  in¬ 
cluded  in  the  plot)  perfect  association  occurred,  so  RMLE  and  UMLE  both 
were  infinite. 

In  summary,  this  relatively  small  simulation  does  not  show  any  advantage 
to  using  maximum  likelihood  estimation  in  the  serial  dependence  model 
rather  than  ordinary  logistic  coefficient  estimates.  Using  these  esti¬ 
mates  and  ^>*1  as  starting  values,  performing  a  single  iteration  of  the 
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scoring  method  gives  a  good  estimate  of  the  log  odds  ratio.  When  f  is 
near  1  this  estimator  is  not  noticeably  different  from  the  others,  bnt 
when  is  extreme  it  avoids  the  problem  of  perfect  correlation. 

If  the  standard  errors  of  the  estimates  are  also  of  interest,  as  is 
usually  the  case,  then  the  maximum  likelihood  estimate  remains  important. 
The  standard  errors  produced  by  this  model  differ  from  those  produced  by 
the  ordinary  logistic  model,  and  are  closer  to  reality. 


Chapter  5 


A  Test  for  Independence 


5.1  Introduction 

Compared  with  ordinary  logistic  regression,  maximum  likelihood  estimation 
of  this  model  takes  a  relatively  large  amount  of  computer  time  and  re¬ 
quires  specialized  software.  It  would  be  useful  therefore  to  be  able  to 
test  for  dependence  without  having  to  do  the  maximum  likelihood  computa¬ 
tions.  In  some  cases  it  may  not  be  thought  necessary  to  compute  the 
maximum  likelihood  estimates  if  a  preliminary  test  does  not  reject  the 
hypothesis  of  independence. 

Such  a  test  can  be  based  on  the  score  statistic,  which  requires  maximiza¬ 
tion  only  over  the  subset  of  the  parameter  space  that  satisfies  the  null 
hypothesis.  Since  independence  implies  f=l,  testing  for  independence  re¬ 
quires  computing  the  ordinary  logistic  estimates,  as  they  maximize  the 
likelihood  subject  to  this  restriction.  As  a  result  the  score  test  can  be 
performed  with  little  more  computational  effort  than  that  required  by 
logistic  regression. 

In  ordinary  linear  regression,  a  test  for  serial  correlation  can  be  based 
on  the  sample  autocorrelation  function  of  the  residuals,  or  equivalently 
on  the  Durbin-Vatson  statistic.  A  test  based  on  the  latter  was  developed 
in  a  series  of  papers  by  Durbin  and  Watson  (1950,  1951,  1971).  As  I  will 
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show  below,  the  score  test  for  this  model  is  bssed  on  the  sample  autoco- 
▼ariance  of  the  data. 

The  score  test  is  described  by  Rao  (1973)  and  by  Cox  and  Hinkley  (1974). 
Let  U  be  the  score  function  and  i  the  information  matrix.  To  test  a  por¬ 
tion  f  of  a  parameter  vector  (f,£),  Cox  and  Binkley  define  the  score 
statistic  W  by 

W  =  D'  iW  C.  ,  [5.1] 

f  f 

where  is  the  portion  of  the  score  function  corresponding  to  j>  and  iW 
is  the  corresponding  submatrix  of  the  inverse  of  the  information  matrix. 
When  f  is  a  scalar  it  is  more  convenient  to  retain  the  sign  by  using 
instead 

f1/2  «=  0  (i¥¥)1/2.  [5.2] 

T 

I  derive  the  score  statistic  for  this  problem  in  section  2,  In  section  3 
I  give  the  asymptotic  distribution  of  W  under  the  null  hypothesis  and  ex¬ 
amine  the  empirical  distribution  in  finite  samples.  In  later  sections  I 
consider  the  distribution  under  alternative  hypotheses. 

5.2  Derivation 

The  score  function  is  defined  as  the  derivative  of  the  log  likelihood. 

Expressions  for  the  derivatives  appear  on  pages  41-42.  Since  this  is  a 

test  for  independence,  these  expressions  are  to  be  evaluated  at  f*T. 

Therefore  ot  “  Pt^t-1  *n<*  the  *core  functions  become 

n 

Of  ’  %  -  l  <VV  (It-rpt-i)  l5-3) 

t=2 


Note  that  since  the  unconditional  expectation  of  xt  is  pt,  the  score  fane 
tion  for  the  odds  ratio  has  the  form  of  the  autocovariance  of  the  {YtJ. 


The  statistic  also  involves  the  information  matrix.  The  three  components 
evaluated  at  ^>=1  are 


-*[0]  -  • 

-E[0]  ■ 

-E r-Ji-i  .  o. 

L  ap  0|>  J 

Therefore  the  complete  information  matrix  can  be  written 


=  iw  ivP  ] 
.  *PP  J 

■  »»(1-v  | 


Pt-i(1"pt-i)  0 


xtx't  J 


if  Pg  is  taken  to  be  eqaal  to  0. 


The  score  statistic  is  therefore 


„  « (W(Yt-r?t-i}1 
s  1 5td-Pt)  pt-i(1_pt-i)  ' 

where  =  1/ (l+expC-x^P) )  and  P  is  the  estimate  of  p  in  the  ordinary 
logistic  model.  Since  this  is  a  test  on  a  scalar  parameter,  the  other 
statistic  can  also  be  defined: 


[  I  (Y  -pJ(Y 


II  Ptd-Pt)  Pt-l(1_5t-l)] 


5.3  Distribution  of  the  score  statistic  under  the  null  hmothesis 


According  to  asymptotic  theory,  the  distribution  of  1  under  the  null  hy¬ 
pothesis  of  independence  should  be  central  chi-square  with  one  degree  of 
freedom,  and  the  asymptotic  distribution  of  should  be  standard  nor¬ 

mal.  To  examine  the  distribution  of  W  for  finite  samples  I  used  a  Simula 
tion.  Figure  5.1  shows  a  normal  probability  plot  of  a  sample  of 
generated  by  the  following  procedure: 


(1)  generate  X^,...,X^qq  independent  standard  normal  random 
variates 

(2)  generate  Y1,...,Y^qq  independent  Bernoulli  random  variates 
with  probability  pt  of  success  satisfying  log(pt/ (l-pt) )  = 


(3)  perform  ordinary  logistic  regression  and  compute 


I  generated  a  sample  of  400  score  statistics  by  this  procedure,  using  the 
same  (Xt)  each  time.  From  the  plot,  the  sample  seems  consistent  with  a 
standard  normal  distribution. 


Since  W  is  a  test  statistic,  its  upper  tail  behavior  is  of  interest.  For 
example,  (0.10) (400)  -  40  of  the  score  statistics  could  be  expected  to 
fall  above  the  0.90  point  of  the  chi-square  distribution,  or  2.706.  In 
this  sample  48  were  observed  above  the  critical  value.  If  the  distribu- 
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tion  is  correct,  the  number  of  significant  statistics  should  have  a  bino¬ 
mial  distribution  with  mean  40  and  standard  deviation  [(0.1) (0.9) (400)]*^ 
=6,  so  a  difference  of  8  from  the  expected  value  is  not  significant  at 
the  0.05  level. 

The  Kolmogorov-Smirnov  distance  between  the  empirical  cumulative  distribu- 
1/2 

tion  of  V  and  the  standard  normal  cumulative  distribution  function  is 
0.049,  a  value  that  is  below  the  0.95  point  of  the  distribution  of  the 
Kolmogorov-Smirnov  statistic,  or  1.36n-*^  =  0.068. 

5.4  Distribution  for  nearby  alternatives :  first  order  approximation 
To  get  some  idea  of  the  power  of  the  test,  it  would  be  useful  to  find  the 
distribution  of  the  score  statistic  when  the  odds  ratio  is  not  equal  to 
one.  It  is  simpler  to  work  directly  with  the  odds  ratio,  but  when  con¬ 
sidering  alternative  hypotheses  it  seems  more  natural  to  use  X=log  y  as 
the  parameter  measuring  dependence.  Unlike  f,  X  can  vary  in  either  direc¬ 
tion  without  limit.  As  will  be  seen  below,  the  effect  of  alternative 
values  of  X  on  the  distribution  of  the  score  statistic  depends  on  the 
magnitude  of  X  but  is  independent  of  its  sign,  or  nearly  so. 

The  asymptotic  distribution  of  a  score  statistic  under  alternative  hypoth¬ 
eses  is  given,  for  example,  by  Cox  and  Hinkley  (1974).  Suppose  the  score 
statistic  V  is  computed  for  a  sample  of  size  n.  With  the  null  hypothesis 
Hq:  X-0,  if  a  sequence  of  alternatives  Hft:  X=6r,  ;  ^  is  to  be  considered, 
then  the  asymptotic  distribution  of  V  is  non-central  chi-square  with  one 
degree  of  freedom  and  with  non-centrality  parameter  6*i^/n.  (This  is 
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true  here  because  the  information  matrix  is  block  diagonal.  In  general 

ij^  should  be  replaced  by  (i^M-*,  where  is  the  corresponding 

submatrix  of  the  inverse  of  i.)  Equivalently  W’  '  4  has  an  asymptotic 

1/2 

normal  distribution  with  mean  Mi^/n)  '  and  variance  1. 


As  a  basis  for  comparison,  I  generated  twenty  additional  samples  of  400 
score  statistics  as  above  for  various  values  of  the  odds  ratio.  I  used 
the  same  set  of  (Xt)  in  each  case.  I  calculated  the  observed  power  func¬ 
tion  in  each  case  as  the  proportion  of  W  values  larger  than  2.706,  the 
0.90  point  of  the  central  chi-square  distribution  with  one  degree  of  free¬ 
dom.  These  values  are  those  labeled  "observed  power"  in  Figure  5.2.  For 
each  value  0  of  the  observed  power,  I  calculated  a  95%  confidence  interval 
for  the  true  power  as  0  -  1.96(0(l-0)/4OO)^2,  and  these  confidence  inter¬ 
vals  also  appear  in  Figure  5.2. 


The  power  given  by  asymptotic  theory  is  simply  P[V>2.706],  where  V  has  a 
chi-square  distribution  with  one  degree  of  freedom  and  with  non-centrality 
parameter  *#iU  "  3*1  (log  f>)a.  In  Figure  5.3,  this  curve  is  superimposed 
on  the  simulation  results. 


Clearly  there  is  some  lack  of  fit,  since  eight  of  the  twenty-one  confi¬ 
dence  intervals  do  not  contain  the  value  on  the  curve.  Qualitatively, 
though,  the  curve  does  seem  to  predict  the  observed  power  pretty  well. 

For  0.3lyl2.5,  the  intervals  do  contain  the  curve.  It  is  not  surprising 
to  see  a  lack  of  fit  for  more  extreme  values,  since  the  curve  is  obtained 


Figure  5.2.  Power  as  a  Function  of  Odds  Ratio 


Odds  Ratio 


Function  of  Odds  Ratio 


Odds  Ratio 
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by  an  asymptotic  calculation  valid  for  nearby  alternatives. 

The  above  procedure  sheds  some  light  on  the  upper  tail  of  the  distribution 
of  the  score  statistic,  but  not  on  the  bulk  of  the  distribution  of  W.  I 
used  the  Kolmogorov-Smirnov  statistic  as  a  summary  of  the  difference  be¬ 
tween  the  cumulative  distribution  functions  of  the  empirical  and  asymp¬ 
totic  distributions,  and  the  results  appear  in  Table  5.1.  Column  1  con- 
tains  the  odds  ratio,  and  column  2  contains  n  '  -20  times  the  Kolmogorov- 
Smirnov  distance  D  between  the  two  distributions.  (The  other  columns  are 
explained  below.)  Values  larger  than  the  upper  95%  point  of  the  distribu¬ 
tion  of  n^^D,  or  1.36,  are  marked  with  an  asterisk.  The  results  here  are 
similar  to  those  above;  there  is  no  significant  lack  of  fit  for  .4^|>i2.5. 

The  following  sections  describe  two  attempts  to  improve  the  accuracy  of 
the  power  curve.  In  the  first  attempt  I  obtain  a  higher  order  approxima¬ 
tion  to  the  distribution  of  W  using  the  results  of  Harris  and  Peers 
(1980).  In  the  second  I  use  more  informal  techniques  to  examine  the 
deviation  of  the  simulated  W1^2  from  its  theoretical  distribution  and  I 
find  an  empirical  adjustment  that  improves  the  approximation  to  the 
observed  power. 

5.5  Distribution  for  nearby  alternatives:  higher  order  approximation 
One  approach  toward  improving  the  fit  to  the  observed  power  function  is  to 
find  a  higher  order  approximation  to  the  distribution  of  the  score  statis¬ 
tic.  Peers  (1971)  gave  such  an  approximation  for  simple  tests,  and  Harris 
and  Peers  (1980)  extended  the  results  to  composite  tests. 
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Table  S.l.  Normalized  Kolmogorov-Smirnov  Distances  Between  the 
Empirical  and  Fitted  Distributions 


For  each  value  of  the  odds  ratio  the  other  columns  give  n1'  -20  times  D, 
the  Kolmogorov-Smirnov  distance  between  the  empirical  distribution  of  the 
corresponding  sample  and  the  fitted  distribution.  Values  larger  than  the 
0.95  point  of  the  null  distribution  of  n  '  D,  or  1.36,  are  marked  with  an 


asterisk. 


Ratio 


First  Order 


Approximation 


Higher  Order 


Empirical 


Approximation  Approximation 


7.903* 

5.842* 

4.931* 

2.352* 

2.342* 

1.713* 


1.338 


1.255 


1.712* 

2.314* 

2.888* 

3.718* 

3.603* 

4.361* 


2.456* 

1.076 


1.609* 

1.350 


1.571* 

1.080 

1.331 

1.304 


1.341 

1.461* 

1.794* 

3.207* 

4.391* 

6.201* 

7.950* 

9.032* 

11.154* 


1.330 

1.017 

1.402* 


1.317 

1.205 


1.234 


1.012 

1.144 


1.309 

2.826* 
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Because  the  derivation  is  lengthy,  I  present  the  results  first. 


I 

t 


The  density  of  the  score  statistic  V  to  order  n can  be  written  as  a 
linear  combination  of  non-central  chi-square  densities,  each  with  the  same 
non-centrality  parameter  but  with  different  degrees  of  freedom.  More 
specifically  the  density  is 

3 

g  (w)  =  f(wjl,p)  +  n  ^  Cj  f  (wjl+2j  ,p)  +  0(n-1),  [5.11] 

j=0 

where  f(.ij,p)  is  the  density  of  a  chi-square  random  variable  with  j 
degrees  of  freedom  and  with  non-centrality  parameter  p.  I  will  give  the 
values  of  p  and  the  c,'s  below. 

•I 

Several  points  can  be  made  about  this  approximation.  First,  the  first 
term  on  the  right  hand  side  is  the  usual  (first  order)  approximation  to 
the  distribution  of  V  under  alternative  hypotheses.  Second,  under  the 
null  hypothesis  all  the  Cj's  are  zero,  so  both  approximations  lead  to  the 
same  distribution  in  this  case.  Third,  this  is  simply  an  approximation  to 
an  asymptotic  distribution,  and  it  will  not  necessarily  integrate  to  one 
in  finite  samples. 

Figure  5.4  contains  the  power  function  for  this  higher  order  approximation 
as  well  as  the  power  function  examined  earlier.  Any  improvement  here  is 
marginal.  The  new  curve  seems  to  give  a  better  fit  to  the  observed  power 
for  f<l,  but  it  gives  a  poorer  fit  for  f>l.  Because  it  is  not  a  true  den¬ 
sity  and  does  not  integrate  to  one,  it  gives  values  of  the  'power  func- 


Figure  5.4.  Power  as  a  Function  of  Odds  Ratio 
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tion"  that  are  larger  than  one  for  y^8.  Since  the  first  order  approxima¬ 
tion  has  the  advantage  of  simplicity,  there  seems  to  be  no  reason  to 
prefer  the  more  complex  approximation. 


The  third  column  of  Table  5.1  contains  the  normalized  Kolmogorov-Smirnov 
distances  between  the  empirical  distribution  of  the  sample  and  the  higher 
order  asymptotic  distribution.  It,  too,  shows  mixed  results,  with  an  im¬ 
proved  fit  for  y<l  but  a  poorer  fit  for  y>l. 

The  remainder  of  this  section  consists  of  calculations  of  the  quantities 
needed  to  apply  the  Harris  and  Peers  results  to  this  model. 


Though  it  is  more  natural  to  use  X  =  log  y  as  the  parameter  of  interest, 
it  is  more  convenient  to  work  with  y.  To  convert  the  results  for  use  with 
X  requires  use  of  the  chain  rule  to  derivatives  of  up  to  third  order.  For 


any  function  u(y). 
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Following  the  notation  of  Harris  and  Peers,  define 


1  8L 
~U2  96.  ' 

XL  1 


ij  n  ae.ae. 

i  j 


3/2  ae.ae.ae,. 

n  i  j  k 


kij  ’  EUijI-  li,j  ■  EI,iV'  li.j.k‘n  E[ui°j*k1- 


.  1/2  o i  i  V  1/2  «r  i 

ki,jk‘»  ElulV’  kljk*“  ^“ijk1  • 

where  L  is  the  log  likelihood.  All  expectations  are  taken  with  j»=l  and 
with  the  true  valne  of  0.  All  the  k's  are  0(1).  Relations  between  these 
quantities  are  given  by  Harris  and  Peers t  in  addition  to  the  familiar 
relationship 

ki.j  +  kij  =  °» 

there  is  a  relationship  between  the  remaining  expectations: 
kijk  +  ki,jk  +  kj,ik  +  kk, ij  +  ki, j ,k  =  °* 


Let  the  symbol  E  represent  the  matrix  of  k^j'si  it  is  simply  the  informa 
tion  matrix  calculated  earlier.  Let  subscripts  on  K  refer  to  the  corres¬ 
ponding  submatrix,  for  example  &22  i*  tke  submatrix  of  K  corresponding  to 
0.  Let  a  dot  subscript  refer  to  the  entire  dimension  of  E,  so  Ei.  = 

[Eh  k12^ *  Define  triply— subscripted  E's  similarly,  so  for  example  E2a<> 
is  a  three  dimensional  array  of  k^,^,  with  i>l. 


Let  e  =  n1'2  log  measure  the  distance  between  the  hypothesized  and  true 
values  of  X.  Define  the  following: 
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K  1  8  ' 

z2  *21  J 


0  0 

0  K22  .  ' 


K'1-  J 


bt(i-pt)pt_1d-pt_1) 


Then  the  probability  density  of  the  score  statistic  is  given  by  equation 
[5.11]  with  non-centrality  parameter 

p  ‘  An  =  x2/£pt(1-pt)pt-i(1-pt-i) 
and  with  coefficients 

°o "  f  [  ‘S.i.rW  *  3,3<kiii*ki.n>  -  3t«.  ,i+“.  ..i’*3  ] 

°1  T  [  *  <tlll“2kl.l,l>  "  3e  ^klll+kl,ll^  '  3,kl,l,l”p/x 

+  3e (K  .+2K  ,)*J  1 

•  el.  •!  il  J 


c2  =  y  8ki,i.inp/x2 


1  K 

*3  “  T  8  kl,l,l* 

The  notation  A*B  used  in  the  expressions  for  cq  and  c^  Beans  ^ij^ij^ij' 


I  present  expressions  for  the  reaaining  quantities  without  proof: 
*1,1.1  =  n_1  1  pt-l(1'pt-l)(1_2pt-l)  Pt(1-pt)(1-2pt) 
K1U  85  -3n_1  ^  Pt<1-PtHl-2pt) 


V- 1**". '**.-*. **  *  .'-  .’•  .■■’.''V-’/*"/-  /*•  ,**  ’•  *’•  V-  .*■  *'•  >  ^  wS  .*•  *’ 


c  f  s  r  n  o 

,2,2  *2,1,2  *2,2,1 


1  Pt(l~Pt>  xtXt-l 


Lack  of  fit  of  W  to  its  asymptotic  distribution  could  take  various 
forms j  it  could  have  a  normal  distribution  but  with  a  different  mean,  a 
normal  distribution  with  a  different  mean  and  variance,  or  a  distribution 
that  is  not  normal.  In  this  section  I  will  find  the  nature  of  the  lack  of 
fit.  I  will  apply  exploratory  techniques  rather  than  large-sample  theory 
to  model  the  lack  of  fit  and  improve  the  power  curve. 


First  suppose  the  distribution  is  normal,  but  the  parameter  values  are  not 
as  predicted  by  asymptotic  theory.  The  sample  means  and  standard  devia- 
tions  of  the  score  statistics  W  '  obtained  by  simulation  appear  in  Table 
5.2  and  in  Figures  5.5  and  5.6,  as  a  function  of  the  odds  ratio.  The 
dashed  lines  in  the  Figures  give  the  values  predicted  by  theory. 


The  sample  means  are  quite  near  the  lines  for  odds  ratios  near  1,  but  they 


are  smaller  in  absolute  value  for  more  exti 


lines  of  the  odds  ratio. 


The  sample  standard  deviations  are  near  1  when  the  odds  ratio  is  larger 
than  1,  but  decrease  as  the  odds  ratio  approaches  0.  I  will  attempt  to 
fit  these  points  empirically  and  see  if  the  fitted  parameter  values  pro¬ 
duce  a  power  curve  that  is  closer  to  ths  observed  power. 
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Table  5.2.  Sample  Means  and  Standard  Deviations  for  the  Score 
j|  Statistics  from  the  Simulation 

For  each  valne  of  the  odds  ratio,  the  table  contains  the  mean  and  standard 
g  deviation  for  the  corresponding  sample  of  400  simulated  score  statistics. 


Odds  Ratio 

Mean 

Standard  Deviation 

.10 

-3.1401 

.811 

.13 

-2.9747 

.854 

.16 

-2.7089 

.857 

.20 

-2.5578 

.891 

.25 

-2.2370 

.872 

.32 

-1.8835 

.891 

Because  the  sample  means  appear  to  be  an  odd  function  of  the  log  odds 
ratio  and  because  the  simple  linear  function  fits  well  for  odds  ratios 
near  1,  this  suggests  adding  a  cubic  term  in  log  A  least  squares  fit 
with  the  linear  term  constrained  to  3.1  '  (the  square  root  of  the  infor¬ 
mation)  leads  to  -0.065  as  the  coefficient  of  the  cubic  term.  This  is  the 
dotted  line  in  Figure  5.5,  and  it  appears  to  give  a  good  fit  to  the  sample 
means . 

The  sample  standard  deviations,  however,  appear  to  lie  on  two  lines:  one 
at  the  constant  value  1.0  for  f>l  and  the  other  with  a  positive  slope.  It 
seems  reasonable  to  assume  continuity,  so  I  fit  the  second  line  by  least 
squares  subject  to  the  constraint  that  it  pass  through  the  point  (1,1). 

The  estimated  slope  is  0.078.  This  is  the  dotted  line  in  Figure  5.6. 

Figure  5.7  shows  the  previous  power  curves  along  with  one  obtained  by 
using  parameters  given  by  this  empirical  fit.  The  new  curve  is  a  marked 
improvement!  it  misses  only  2  of  the  21  confidence  intervals. 

The  last  column  in  Table  5.1  also  shows  a  good  fit.  Only  the  score 
statistics  calculated  with  the  odds  ratios  equal  to  .16  and  10  produce  a 
normal  distribution  that  is  significantly  different  from  the  empirical 
distribution  at  the  0.05  level. 

Normal  probability  plots  of  the  samples  show  that  the  normality  assumption 
is  justified.  Figures  5.8  and  5.9  contain  the  plots  for  f=0.1  and  f=10, 
respectively.  The  points  seem  to  lie  on  a  straight  line.  Plots  for  other 


wer  as  a  Function  of  Odds  Ratio 


Odds  Ratio 


Figure  5.8.  Normal  Probability  Plot  for  Score  Statistics 
Generated  with  il>  =  0.10 
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▼aloes  of  the  odds  ratio  are  similar. 

To  see  if  the  empirical  fit  holds  more  generally,  I  performed  a  second 
simulation.  1  generated  {Xt)  via  the  relation  Xt  =  pXt_^  +  ut,  where  Cot) 
is  a  sequence  of  independent  standard  normal  random  variables,  p  =  0.5, 
and  Xq  =  0.  The  sample  size,  intercept,  and  slope  were  150,  -1,  and  0.5, 
respectively. 

The  results  are  summarized  in  Figures  5.10-5.12.  In  Figure  5.10  the 
dotted  line  is  the  curve 

p  =  S.391/2(log  f)  -  0.065  (5 .39/3 .10) 1/2 (log  f)3, 
where  5.39  is  the  information  number  in  the  second  simulation  and  3.10  is 
the  information  number  in  the  original  simulation.  The  asymptotic  line 
gives  an  excellent  fit  to  the  observed  means  when  flo.5,  unlike  the  pre¬ 
vious  case.  For  f<0.5  the  behavior  is  qualitatively  similar  to  that  found 
in  the  first  simulation,  but  the  empirical  fit  is  not  as  good. 

In  Figure  5.11  the  results  are  similar  to  those  found  earlier,  but  there 
seems  to  be  a  steeper  slope  for  qp<l  and  there  is  some  indication  that  when 
t>>l  the  standard  deviation  exceeds  1. 

Figure  5.12  contsins  a  plot  of  the  observed  power,  first  order  asymptotic 
power,  and  the  power  obtained  b>  the  empirical  fit.  The  fitted  power  may 
be  marginally  better  than  the  asymptotic  power  for  f<0.6,  but  the  asymp¬ 
totic  power  seems  better  elsewhere. 


5.12.  Second  Simulation: 
s  a  Function  of  Odds  Ratio 


Odds  ratio 


In  summary,  the  simple  first  order  approximation  to  the  distribution  of 
the  score  statistic  gives  a  power  function  that  is  not  significantly 
different  from  the  power  observed  in  a  random  sample  when  the  odds  ratio 
differs  from  one  by  a  factor  of  no  more  than  two  or  three.  For  more  ex¬ 
treme  odds  ratios  the  approximation  overestimates  the  power,  though  it 
might  be  considered  adequate  as  a  qualitative  description  of  the  power.  A 
higher  order  approximation  does  not  significantly  improve  the  agreement 
between  the  empirical  and  theoretical  power  curves. 

A  look  at  the  sample  of  score  statistics,  though,  shows  that  their  distri¬ 
bution  is  normal  but  with  parameters  that  vary  systematically  from  those 
predicted  in  the  asymptotic  approximation.  Fitting  a  smooth  curve  to 
these  parameters  allows  calculation  of  a  power  function  more  in  agreement 
with  the  observed  power.  This  empirical  fit  can  be  obtained  by  simulation 
for  any  given  set  of  X's,  but  a  fit  obtained  for  one  particular  set  of  X's 
does  not  appear  to  be  valid  for  other  X's. 


Chapter  6 
Missing  Data 

In  this  chapter  I  will  examine  the  effect  of  missing  Y  values  on  the  score 
statistic  and  to  a  more  limited  extent  on  the  estimation  process.  I  will 
assume  that  the  corresponding  X  values  are  not  missing,  and  I  will  comment 
on  this  assumption  where  appropriate. 

6.1  Effect  on  the  score  test  for  independence 

In  this  chapter  I  will  examine  the  effect  of  missing  Y  values  on  the  score 
statistic  for  testing  independence  and  on  the  estimation  process. 

Recall  that  with  no  missing  data  the  log  likelihood  function  and  its 
derivatives  can  be  written 

L  =  log  PlY^yi]  +  I  Yt  log  n  +  (1-Yt)  log  (1-n), 

dL  _  y  Y-n  dn 

df>  *  n(l-n)  Bf  ‘ 


BL  _  y  Y-n  dn 

dp  1  n(l-n)  dp  ' 

where  n  =  P[Yt=l lYt-1] .  This  leads  to  a  score  function  for  f  that 
contains  products  of  consecutive  Y  values. 


Suppose  is  not  observed  but  *t-2*  an<*  2  ara  observed.  If  I 

define  nt>m  =  P[Yt=l|Yt_m]  and  nt(z)  =  P[Yt=l |Yt_2=z) ,  then  if  Yt=l  the 
contribution  to  the  log  likelihood  from  Yt  is 

log  nt#2  =  log (  P[Yt=l I Yt_1=l]  P[Yt_!=l |Yt_2] 

+  P[Yt=l|Yt_1=0]  P[It-l=°lTt-21  > 

=  log (  nt(l)  nt_i(Yt_2)  +  nt(0)  (l-n^CY^j) )  ), 

and  if  Yt=0  the  contribution  is  log(l-nt>2).  (Note  that  these  are  random 
variables;  they  depend  on  Yt_2.) 

Using  this  expression  for  the  likelihood,  the  score  test  can  still  be  per¬ 
formed  and  parameter  estimates  can  still  be  found  by  maximum  likelihood 
when  a  single  Y  value  is  missing.  If  two  or  more  consecutive  Y  values  are 
missing,  m  can  be  written  similarly  by  summing  over  all  possible  values 
of  Yt_2  through  Yt_m+2.  This  cannot  be  done,  however,  if  the  correspond¬ 
ing  X's  are  missing,  because  the  likelihood  is  a  function  of  all  these 


Suppose  first  that  only  Yt_2  is  missing.  The  contribution  of  the  tth  term 
of  the  likelihood  to  the  score  function  for  f  can  be  written 
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,  dn  (1)  dn  (Y  ,) 

=  t -  <Y  ,)  +  «  (1)  — ^  j--  t-~2 

3f  df  t-1  t-2  t  df» 


an.(O)  dn.  ,) 

+  — l -  <1-  *.,(!.-»  -  n  (0)  t-Z 

df>  t-1  t-2  t  df 


Evaluated  at  f=l  this  simplifies  to 


dn 


t,2 


df 


■  [,1"pt-llpt<1_pt,!  pt-l  *  Pt  [<It-2‘pt-2,pt-l(1"pt-l 


)] 


+  <1_pt-l)  '  pt  l(It-2“pt-2>pt-lll-pt-l)I 


=  0. 

Therefore  there  is  no  contribution  to  the  score  statitistic  due  to  the  de¬ 
pendence  between  Yt  and  Yt_2  **  ^t-1  is  biasing. 


This  is  also  true  if  more  than  one  observation  is  missing,  as  can  be  seen 
by  writing 


=  «*,<D  , _ „ (Y.  .  )  +  n.  ,(0)  <l-n_„  t_,(Y4_J)  . 


t  k  t  2  t-2  k-2  t~k  t  2 


t-2, k-2  t-k 


so  that 


dn 


t,k 


df 


d”t,2U) 


<Y^  J  +  „<1> 


dv  t-2, k-2'  t-k  t,2 


dnt-2,k-2(Yt-k) 

df 


2(0)  3nt-2  k-2(Yt-k) 

+  "If"  (1'nt-2,k-2(Yt-k))  -  "t.2(0)  -■■  'If-— 

The  first  factors  in  the  first  and  third  terms  are  zero  (from  above), 
while  the  second  and  fourth  terms  cancel,  since  2^  under 


the  null  hypothesis 
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In  summary,  if  Yt_^  is  missing  there  is  no  contribution  to  the  score  sta¬ 
tistic  from  the  dependence  between  Yt_2  and  Yt.  The  only  positive  terms 
in  the  score  statistic  are  those  that  involve  consecutive  observations. 
The  reduced  dependence  between  Yt_^  and  Yt  cannot  be  measured  by  this 
statistic  for  any  k>l. 


If,  for  example,  every  second  observation  in  a  sample  were  missing,  it 
might  be  possible  to  reparametrize  and  use  0  =  f(|>)  as  the  measure  of  de¬ 
pendence,  where  f  is  such  that  f'(l)=0.  The  score  function  would  be 
(dL/d|»)/f  *  (f») ,  so  it  might  approach  a  finite  non-zero  value  as  |>  ap¬ 
proaches  1  for  a  suitably  chosen  f.  However  in  such  a  case  it  may  be 
simpler  to  assume  a  model  in  which  the  odds  ratio  between  Yt  and  Yt_2  is 
constant  for  all  t,  and  to  treat  the  problem  as  if  no  data  were  missing. 
This  is  a  different  model,  but  it  might  be  nearly  the  same  if  the  marginal 
probabilities  are  nearly  constant. 


In  the  more  likely  case  of  some  consecutive  observations  together  with 
some  separated  by  gaps,  however,  the  contributions  from  consecutive  obser¬ 
vations  are  infinitely  larger  than  the  contributions  from  observations 
separated  by  gaps.  Therefore  no  such  reparametrization  is  possible  in 
this  case. 


There  is  no  standard  procedure  for  modifying  the  Durbin-Vatson  statistic 
for  missing  values  when  testing  for  serial  correlation  in  a  least  squares 
regression.  Three  possible  modifications  are  given  by  Savin  and  White 


' --V'V  V->V  V-V'  -  - -/vW  ■  ■  . 


91 


(1978)  and  by  Honohan  and  McCarthy  (1982).  Two  of  these  are  similar  to 
the  score  statistic  presented  herei  they  omit  the  terms  in  the  statistic 
that  involve  missing  values.  The  other  simply  removes  any  missing  values 
and  treats  any  surrounding  observations  as  if  they  vere  taken  at  consecu¬ 
tive  times. 

6.2  Effect  on  estimation  of  the  odds  ratio 

It  is  still  possible  to  write  down  the  likelihood  function  in  the  presence 
of  missing  data,  so  the  maximum  likelihood  estimates  can  be  found.  Be¬ 
cause  there  is  no  simple  general  expression  for  the  derivatives  of  the  log 
likelihood,  however,  it  would  probably  be  easier  to  use  a  derivative-free 
maximization  procedure  in  this  case.  Using  Newton's  method  would  require 
calculating  and  programming  first  and  second  derivatives  for  every  "gap 
length”  observed  in  the  sample. 

It  seems  that  some  information  would  be  lost  if  n  observations  span  m>n 
time  periods,  in  comparison  with  the  information  in  n  consecutive  observa¬ 
tions.  I  will  show  below  that  this  is  usually,  but  not  always,  the  case. 

Suppose  the  coefficients  and  therefore  the  marginal  probabilities  (pt)  are 
known,  and  I  want  to  estimate  the  odds  ratio  9.  Then  the  contribution  to 
the  Fisher  information  for  estimating  f  can  be  calculated  for  an  observa¬ 
tion  both  when  the  previous  value  is  observed  and  when  it  is  missing.  The 
ratio  of  the  two  information  numbers  gives  the  asymptotic  relative  effi¬ 
ciency  of  the  two  estimators  and  provides  a  measure  of  the  information 
loss  caused  by  the  intervening  missing  observation.  These  information 
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muobers  are  calculated  below. 

Table  6.1  contains  this  ratio  under  the  following  conditions.  In  the  case 
of  consective  observations  I  assume  they  each  have  marginal  probability 
0.5  of  "success. ■  In  the  second  case  I  assume  the  same  two  random  vari¬ 
ables  are  separated  by  a  single  missing  value  whose  marginal  probability 
of  success  takes  values  between  0.5  and  0.95  in  increments  of  0.05.  In 
both  cases  I  use  odds  ratios  increasing  from  1/32  to  32  in  multiples  of  2. 

The  surprising  feature  in  this  table  is  the  appearance  of  ratios  larger 
than  1  for  pt_j  near  0.5  and  for  extreme  values  of  the  odds  ratio.  This 
indicates  that  for  those  parameter  values,  if  only  two  observations  can  be 
taken  it  is  advantageous  not  to  take  them  consecutively,  but  to  allow  an 
intervening  value  to  pass  unobserved.  When  p  is  very  large  two  consecu¬ 
tive  observations  take  the  same  value  with  a  very  high  probability,  so 
little  information  is  gained  about  the  value  of  9.  Skipping  an  observa¬ 
tion  reduces  the  dependence  and  provides  more  information. 

This  effect  is  more  pronounced  in  Table  6.2,  where  the  marginal  probabili¬ 
ties  of  the  observed  values  are  0.1.  Here  the  ratio  exceeds  1  for  certain 
values  of  the  other  marginal  probabilities  when  the  odds  ratio  is  as  high 
as  0.25.  For  these  values  of  the  odds  ratio,  however,  the  dependence  is 
not  simply  reduced  by  inserting  a  missing  observation,  its  direction  is 
also  changed. 

Suppose  a  statistician  is  able  to  take  a  fixed  number  of  observations  of  a 
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Table  6.1.  Asymptotic  Relative  Efficiency 
when  pt  =  0.5 


Asymptotic  relative  efficiency  of  the  maximum  likelihood  estimator  of  the 
odds  ratio  when  a  single  missing  valne  intervenes  between  two  observa¬ 
tions,  as  compared  with  two  consecutive  observations.  P  is  the  marginal 
probability  that  the  missing  variable  is  1.  The  observed  variables  all 
have  marginal  probability  0.5  of  success. 


Odds  Ratio 


p 

1 

2 

4 

8 

16 

32 

50 

.000 

.114 

.400 

.743 

1.059 

1.314 

55 

.000 

.112 

.388 

.708 

.974 

1.134 

60 

.000 

.104 

.353 

.612 

.767 

.761 

65 

.000 

.093 

.301 

.482 

.529 

.434 

70 

.000 

.078 

.239 

.346 

.328 

.225 

75 

.000 

.061 

.174 

.225 

.185 

.109 

80 

.000 

.043 

.114 

.131 

.094 

.049 

85 

.000 

.026 

.065 

.066 

.042 

.020 

90 

.000 

.013 

.028 

.026 

.015 

.007 

95 

.000 

.003 

.007 

.006 

.003 

.001 

P 

1 

1/2 

1/4 

1/8 

1/16 

1/32 

50 

.000 

.114 

.400 

.743 

1.059 

1.314 

55 

.000 

.112 

.388 

.708 

.974 

1.134 

60 

.000 

.104 

.353 

.612 

.767 

.761 

65 

.000 

.093 

.301 

.482 

.529 

.434 

70 

.000 

.078 

.239 

.346 

.328 

.225 

75 

.000 

.061 

.174 

.225 

.185 

.109 

80 

.000 

.043 

.114 

.131 

.094 

.049 

85 

.000 

.026 

.065 

.066 

.042 

.020 

90 

.000 

.013 

.028 

.026 

.015 

.007 

95 

.000 

.003 

.007 

.006 

.003 

.001 
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Table  6.2.  Asymptotic  Relative  Efficiency 
when  pt  =  0.1 


Asymptotic  relative  efficiency  of  tbe  maximum  likelihood  estimator  of  the 
odds  ratio  when  a  single  missing  valne  intervenes  between  two  observa¬ 
tions,  as  compared  with  two  consecutive  observations.  P  is  the  marginal 
probability  that  the  missing  variable  is  1.  The  observed  variables  all 
have  marginal  probability  0.1  of  success. 


Odds  Ratio 


p 

1 

2 

4 

8 

16 

32 

.50 

.000 

.061 

.085 

.054 

.024 

.009 

.55 

.000 

.051 

.062 

.036 

.015 

.005 

.60 

.000 

.041 

.044 

.023 

.009 

.003 

.65 

.000 

.031 

.030 

.015 

.005 

.002 

.70 

.000 

.023 

.020 

.009 

.003 

.001 

.75 

.000 

.016 

.013 

.005 

.002 

.001 

.80 

.000 

.010 

.007 

.003 

.001 

.000 

.85 

.000 

.006 

.004 

.001 

.000 

.000 

.90 

.000 

.002 

.001 

.001 

.000 

.000 

.95 

.000 

.001 

.000 

.000 

.roo 

.000 

P 

1 

1/2 
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binary  first-order  Markov  process  with  known  marginal  probabilities,  and 
his  object  is  to  estimate  the  odds  ratio  of  the  matrix  of  transition  pro¬ 
babilities.  The  calculations  used  in  creating  these  tables  could  be  used 
to  determine  an  optimal  sampling  scheme.  For  example,  if  the  marginal 
probabilities  were  all  0.5,  the  first  row  in  Table  6.1  indicates  that  more 
information  about  the  odds  ratio  is  obtained  by  observing  the  process  at 
times  1  and  3  than  at  times  1  and  2  if  f  2  16. 

These  tables  show  that  for  certain  sequences  of  marginal  probabilities,  if 
the  odds  ratio  is  known  a  priori  to  lie  in  a  certain  region  it  may  be 
profitable  to  observe  the  process  intermittently  rather  than  continuously. 
In  such  cases  calculation  of  the  above  ratio  could  allow  an  optimal  sam¬ 
pling  scheme  to  minimize  the  asymptotic  variance  of  the  estimate  of  the 
odds  ratio.  However  for  most  values  of  the  odds  ratio  and  marginal  proba¬ 
bilities  it  is  better  to  take  consecutive  observations.  In  these  tables 
and  in  all  others  I  examined,  the  ratio  is  always  less  than  0.21  for  odds 
ratios  within  a  factor  of  2  of  1.  It  is  unlikely  that  in  any  realistic 
application  the  evidence  presented  here  warrants  letting  values  pass 
unobserved. 

This  phenomenon  was  observed  previously  in  the  case  of  stationary  first 
order  autoregressive  processes  with  known  mean  0  by  Dunsmuir  (1981).  (His 
univariate  continuous  model  is  analagous  to  the  binary  process  with  known 
constant  marginal  probabilities.)  His  results  can  be  used  to  show  that 
for  values  of  the  autoregressive  parameter  larger  than  3“*^,  it  is  more 
advantageous  to  skip  a  time  point  between  observations  than  to  take  two 


consecutive  observations  of  the  process. 

Dunsarair  goes  further  and  derives  formulas  that  can  be  used  to  compute  the 
asymptotic  relative  efficiencies  whenever  the  frequencies  of  gap  lengths 
are  given.  He  applies  these  results  to  two  types  of  sampling  schemes: 
Bernoulli  sampling,  where  the  series  is  observed  or  not  at  each  time  point 
according  to  a  sequence  of  independent  Bernoulli  random  variables,  and 
regular  A-B  sampling,  where  the  series  is  observed  according  to  a  repeat¬ 
ing  pattern  of  A  observations  and  B  missing  values. 

In  the  binary  case  the  asymptotic  relative  efficiency  could  be  calculated 
for  any  particular  pattern  of  regular  A-B  sampling.  Bernoulli  sampling, 
on  the  other  hand,  involves  random  unbounded  stretches  of  missing  observa¬ 
tions,  and  since  no  general  expression  for  the  information  as  a  function 
of  gap  length  seems  to  be  feasible,  the  asymptotic  relative  efficiency 
cannot  be  calculated  in  closed  form. 

In  the  remainder  of  this  section  I  will  calculate  the  Fisher  information 
only  for  the  two  cases  used  in  computing  the  ratios  in  Table  1:  consecu¬ 
tive  observations  and  observations  separated  by  a  single  missing  value. 

The  Fisher  information  for  consecutive  observations  is  given  in  Chapter  4, 
but  I  repeat  it  here  for  convenience. 

For  an  observation  Y2  preceded  by  another  observation  Y^,  the  contribu¬ 
tions  to  the  log  likelihood  and  its  derivatives  are 
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The  expectation  of  the  first  tern  in  the  second  derivative  is  0,  so  the 
Fisher  information  is 
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This  is  the  information  obtained  from  an  observation  when  the  preceding 
observation  is  not  missing. 

Now  suppose  a  single  unobserved  Y2  intervenes  between  two  observed  values 
Y2  and  Yg.  For  convenience  define  the  following  four  quantities  and  their 
derivatives : 
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Simplifying  these  derivatives  leads  to 
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Then  letting  A  =  p2(l-p2)»  the  contribution  to  the  likelihood  from  the 
term  can  be  written 
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The  expectation  of  the  second  factor  in  square  brackets  is  zero,  so  the 
Fisher  information  does  not  contain  the  derivative  of  the  first  term. 
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The  negative  of  the  latter  quantity  ia  the  expression  for  the  Fisher 
inforaation  that  was  used  in  conputing  Tables  6.1  and  6.2. 
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Chapter  7 
Graphics 

Autocorrelation  in  least  squares  regression  is  often  easily  detected  in  a 
plot.  If  the  coefficients  are  estimated  by  least  squares  and  the  esti¬ 
mates  are  used  to  compute  residuals  {rtJ,  then  a  plot  of  rt  as  a  function 
of  t  will  generally  show  any  serial  dependence 

Similar  plots  are  not  as  useful  in  the  serial  dependence  model.  One 
reason  for  this  is  that  the  residuals  in  some  sense  cannot  be  separated 
from  the  fitted  values  as  they  can  in  least  squares. 


1 


The  two  models  differ  in  the  constraints  placed  on  the  residuals.  In 
least  squares,  adding  any  fitted  value  to  any  residual  rt  produces  an 
acceptable  observation  yt  =  yt  +  rfc .  In  binary  regression  for  each  fitted 
probability  pt  there  are  only  two  possible  residuals.  l-pt  and  ~Pt> 

Another  difference  is  in  the  effect  on  joint  probabilities  of  the  marginal 
probabilities.  In  least  squares,  if  the  true  error  process  is  (et),  the 
probability  that  and  et_^  have  the  same  sign  is  a  function  only  of  the 
autocorrelation.  In  binary  regression  the  probability  that  yt~pt  and 
yt_2~pt_l  have  the  same  sign  depends  not  only  on  the  odds  ratio  but  also 
on  pt  and  pt_^.  If  both  siarginal  probabilities  are  close  to  one.  then  for 
some  ft  values  smaller  than  1  the  probability  that  the  two  errors  have  the 
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same  sign  is  larger  than  O.S. 


Figure  7.1  contains  three  plots  of  residuals  from  an  ordinary  logistic 
regression  as  a  function  of  tiae.  In  each  case  there  are  100  observations 
with  aarginal  probabilities  satisfying  log  (pt/(l-pt>)  -=  l+xt,  where  {x ^ } 
are  independent  standard  normal  random  variables.  I  chose  these  plots 
subjectively  as  typical  results  for  sequences  generated  using  y  =  4,  1, 
and  0.25.  The  values  of  the  score  statistics  for  testing  independence  are 
also  given  on  the  plot. 


The  residuals  used  in  the  plot  are  standardized  as  follows: 
1/4 _ yt~*t 


rt  *  “ 


T/4 


—1/2 

I  chose  this  scale  because  n  '  'trt-l  *  component  of  the  score  statis¬ 
tic.  This  will  be  more  important  in  Figure  7.2.  Here  it  does  not  change 
the  visual  impression  of  the  plots,  only  the  scale  markings  on  the  verti¬ 
cal  axis. 


At  first  glance  the  appearance  of  each  of  these  plots  resembles  that  of 
the  least  squares  residuals  from  a  process  with  negative  autocorrelation, 
since  the  lines  are  jagged  and  they  cross  the  axis  frequently.  However 
this  is  in  large  part  due  to  the  discreteness  of  the  residuals  rather  than 
their  serial  dependence. 

If  the  three  plots  are  compared,  some  features  become  apparent.  Vide 
hills  and  valleys  are  more  comaion  for  large  values  of  the  odds  ratio  than 


for  small  values.  This  is  especially  true  for  deep  valleys;  consecutive 
large  negative  residuals  occasionally  appear  when  p=4  (such  as  at  t=16  and 
28)  but  rarely  when  9=0.25.  Also  more  numerous  peaks  and  valleys  for  low 
values  of  9  give  the  impression  that  the  plots  are  more  "stretched”  hori¬ 
zontally  moving  from  the  bottom  plot  to  the  top. 

Unfortunately  none  of  these  visual  impressions  is  as  striking  as  the 
values  of  the  score  statistics  associated  with  each  plot. 

Because  the  score  statistic  is  so  useful,  it  can  be  presented  visually  by 
noting 

fl/2  =  V  _ 7t~Pt _  _ ^-l^t-l _ 

tf2  IIPt(l-lt)Pt_1d-Pt.1)31/4  I Ipt ( 1_Pt )Pt_i ( > ]  1/4 

-1/2  v 

1  rt  rt-l  ’ 

so  a  plot  of  rt  against  rt_2  may  be  useful.  Figure  7.2  contains  these 
plots  for  the  same  residual  vectors  used  in  Figure  7.1. 

There  are  several  features  in  these  plots  that  are  worth  exploring. 

First,  the  plot  for  9=4  appears  to  be  divided  into  four  clusters.  This 
phenomenon  occasionally  appears  in  this  type  of  plot  for  any  value  of  91 
it  is  caused  more  by  the  marginal  probabilities  than  by  the  value  of  9. 

If  there  are  few  small  marginal  probabilities,  then  there  will  be  few 
small  negative  residuals.  Therefore  the  scatter  plot  will  be  sparse  just 
below  the  horizontal  axis  and  just  to  the  left  of  the  vertical  axis. 


A  second  feature  is  the  presence  of  many  points  near  the  origin  in  the 


Figure  7.2.  Scatter  Plot  of  Residuals  at  Adjacent  Times 


first  quadrant.  This  again  depends  on  the  marginal  probabilities;  if 
there  are  several  consecutive  marginal  probabilities  near  1,  there  will  be 
several  small  positive  residuals.  If  the  odds  ratio  is  large,  though, 
this  tendency  will  be  more  pronounced,  as  in  Figure  7.2. 

A  third  feature,  perhaps  not  as  obvious  as  the  other  two,  is  that  there 
are  more  extreme  points  in  quadrants  I  and  III  if  j»>l  and  more  in  quad¬ 
rants  II  and  IV  if  yi<l.  (Here  I  define  "extreme”  points  as  those  with 
both  coordinates  far  from  0.)  It  is  this  feature  that  seems  to  be  a  good 
indicator  of  serial  dependence. 

Extreme  points  indicate  that  two  consecutive  observations  took  values  for 
which  the  marginal  probabilities  were  relatively  low.  The  odds  ratio, 
however,  is  a  measure  of  how  the  joint  probability  differs  from  the  prod¬ 
uct  of  the  joint  probabilities,  so  the  frequency  of  these  events  gives 
some  information  about  the  odds  ratio.  (In  Figure  7.2  there  are  no  ex¬ 
treme  points  in  quadrant  I  when  f=4,  but  this  is  not  typical  of  such 
plots . ) 

The  score  statistic  is  a  multiple  of  a  sum  of  the  products  of  consecutive 
standardized  residuals,  so  the  contribution  of  a  point  depends  on  the 
product  of  its  coordinates.  Therefore  in  interpreting  these  plots  it  is 
helpful  to  consider  each  point  in  relation  to  the  hyperbolas  defined  by 
constant  values  of  rtrt_j.  It  may  be  useful  to  superimpose  these  hyper¬ 


bolas  on  the  plot. 
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It  may  also  be  helpful  to  look  at  this  plot  as  a  pictorial  representation 
of  a  two-by-two  table.  If  £p t J  were  constant,  the  maximum  likelihood  es¬ 
timate  of  f  would  be 

•  _  (#  of  points  in  quadrant  I)  (#  of  points  in  quadrant  III) 


(#  of  points  in  quadrant  II)  (#  of  points  in  quadrant  IV) 

(The  points  would  also  appear  in  the  same  plotting  position.)  If  the  mar¬ 
ginal  probabilities  do  not  vary  a  great  deal,  this  relation  suggests  that 
counting  points  in  each  coordinate,  or  just  obtaining  some  visual  impres¬ 
sion  of  the  counts,  may  give  information  about 


In  summary,  residual  plots  for  the  logistic  regression  model  do  not  give 
the  strong  visual  indication  of  serial  dependence  that  they  give  in  ordi¬ 
nary  least  squares.  Information  about  serial  dependence  can  be  obtained 
through  inspection  of  these  plots,  but  the  most  striking  features  of  the 
plots  are  not  those  that  are  most  useful  in  detecting  serial  dependence. 
Experience  or  careful  study  is  needed  in  order  to  extract  the  desired 
information.  The  score  statistic  is  a  much  better  indicator  of  serial 
dependence . 


£ 
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Chapter  8 

Application  to  EKG  Data 

The  inspiration  for  the  serial  dependence  model  came  from  work  on  the 
automatic  classification  of  heart  beats  by  their  EKG  traces.  In  this 
chapter  I  give  a  brief  description  of  the  problem.  I  also  apply  some  of 
the  procedures  developed  in  this  paper  and  I  mention  some  of  the  diffi¬ 
culties  that  arise. 

8.1  Background 

In  this  section  I  give  a  brief  description  of  the  automatic  beat  classifi¬ 
cation  problem.  More  details  are  given  by  Ngwengwe  (1984). 

Examination  of  EKG  traces  can  give  valuable  information  about  the  likeli¬ 
hood  of  future  heart  problems.  Often  life-threatening  heart  trouble  such 
as  ventricular  fibrillation  is  preceeded  by  milder  arrhythmia.  Detection 
of  abnormalities  can  therefore  aid  the  physician  in  deciding  whether  pre¬ 
ventative  measures  are  required. 

Figure  8.1  shows  a  typical  normal  beat.  The  curve  is  the  electrical  po¬ 
tential  measured  between  two  electrodes  placed  on  the  patient's  chest. 

Each  beat  consists  of  a  small  P  wave,  a  larger  QRS  complex,  and  a  small  T 
wave.  A  physician  can  detect  abnormal  beats  such  as  premature  ventricular 


Figure  8.1.  Idealized  Normal  Beat.  This  figure  shows  the  components  of  a 


Figure  8.2.  Choice  of  Points  for  Ellipse.  The  solid  curve  is  the  magnitude  of 
the  signal.  Only  consecutive  points  around  the  peak  with  a  magnitude  at 


Points  included 
in  ellipse 


Baseline  =  0 
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contractions  (PVC's)  by  examining  the  EKG  trace  for  beats  that  differ  in 
soae  way  from  this  normal  form.  Ngwengwe  studied  the  use  of  automatic 
procedures  for  detecting  abnor  ulities  by  computer. 

This  study  used  the  MIT/BIH  database.  The  database  contains  EKG  measure¬ 
ments  on  forty-eight  patients,  each  about  thirty  minutes  in  length.  Each 
observation  consists  of  a  pair  of  measurements,  giving  the  electrical 
potential  between  two  pairs  of  electrodes  placed  on  the  chest  along 
roughly  perpendicular  axes.  Along  with  each  trace  is  a  set  of  beat  loca¬ 
tions  and  classifications  provided  by  a  cardiologist,  so  the  number  of 
beats  and  the  type  of  each  beat  can  be  considered  known  for  the  purpose  of 
this  work. 

Ngwengwe  carried  out  a  study  of  the  ability  of  various  features  of  the  EKG 
traces  to  discriminate  between  normal  heart  beats  and  premature  ventricu¬ 
lar  contractions.  Among  the  techniques  used  to  measure  the  power  of  each 
feature  as  a  discriminr ting  variable  were  linear  discriminant  analysis, 
recursive  partitioning,  and  logistic  regression. 

Some  of  the  best  features  were  suggested  by  a  simple  graphical  procedure. 
If  the  two  components  (or  channels)  of  the  EKG  measurement  are  plotted 
against  each  other  and  observed  over  time,  they  appear  to  trace  out  an 
ellipse.  The  appearance  of  the  ellipse  is  different  for  PVC's  than  for 
normal  beats.  Ngwengwe  found  that  features  associated  with  this  ellipse 
provided  excellent  discrimination  between  normal  beats  and  PVC's. 


Ill 


Figures  8.2  and  8.3  illustrate  the  procedure  Ngwengwe  used  to  obtain  the 
ellipse  features.  Between  each  pair  of  beats  is  a  relatively  long  stretch 
over  which  the  signal  is  roughly  constant.  Taking  the  median  signal  over 
a  long  range  therefore  provides  a  'baseline*  signal  that  can  be  subtracted 
from  the  entire  range.  This  can  be  done  for  both  channels.  This  provides 
an  origin,  and  the  distance  of  the  two  dimensional  measurement  from  this 
origin  is  plotted  in  Figure  8.2. 

The  bulk  of  the  apparent  ellipse  consists  of  points  in  the  time  range 
during  which  the  magnitude  of  the  signal  is  at  least  ten  percent  of  its 
peak  value.  This  is  the  cutoff  line  in  the  figure.  Only  points  in  this 
range  are  used  in  the  ellipse  calculation.  For  normal  beats,  this  range 
generally  includes  only  points  from  the  OKS  complex.  For  PVC's,  on  the 
other  hand,  this  criterion  may  cause  points  from  the  P  wave  or  the  T  wave 
to  be  included. 

Figure  8.3  is  a  typical  plot  of  the  two  components  of  the  EXG  trace.  One 
end  of  the  ellipse  contains  many  points,  including  those  along  the  base¬ 
line.  These  are  excluded  by  the  cutoff.  The  other  points  trace  out  the 
ellipse,  and  the  distance  between  these  points  generally  increases  as  the 
points  move  toward  the  opposite  end  of  the  ellipse. 

The  parameters  of  the  ellipse  can  be  estimated  by  computing  the  mean  vec¬ 
tor  and  covariance  matrix  for  the  points  outside  the  cutoff.  Because  of 
the  closer  spacing  at  one  end  of  ellipse,  it  is  necessary  to  weight  the 
points  according  to  a  scheme  described  by  Ngwengwe.  The  ellipse  then  con- 


Figure  8.3.  Ellipse  Parameters.  The  parameters  of  the  ellipse  are  estimated 
by  a  method  described  by  Ngwengwe  (1983).  Parameters  used  in  this  chapter 
are  the  number  of  points  in  the  ellipse  and  the  coordinates  of  the  center. 


Center  of  fitted 
ellipse 


Excluded  by 
10%  criterion 
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sists  of  those  points  with  s  Mahal snob  is  distance  froa  the  center  eqnal  to 
two. 

Ngwengwe  investigated  the  ability  of  the  five  ellipse  paraaeters  and  the 
nuaber  of  points  in  the  ellipse  to  discriainate  between  noraal  beats  and 
PVC's.  The  three  aeasnreaents  he  fonnd  aost  nsefnl  are  the  following: 
NPOINTS:  the  nnaber  of  points  in  the  ellipse,  a  aeasnre  of  the  width 

of  the  beat. 

XCENTER:  the  X  coordinate  of  the  center  of  the  ellipse,  a  aeasnre  of 
the  height  along  channel  1. 

YCENTER:  the  Y  coordinate  of  the  center  of  the  ellipse,  a  aeasnre  of 

the  height  along  channel  2. 

These  are  the  features  that  I  will  use  in  this  chapter. 

8.1  Logistic  regression 

Various  difficulties  occur  in  trying  to  fit  the  probability  of  a  PVC  as  a 
function  of  the  above  features  by  logistic  regression.  For  aany  patients 
it  is  possible  to  separate  the  noraal  beats  and  the  PVC's  by  a  hyperplane 
in  the  three  diaensional  feature  space.  In  some  cases  a  single  feature, 
usually  NPOINTS,  would  separate  the  two  beat  types.  This  is  a  sign  that 
the  features  work  well,  but  it  prevents  the  fitting  of  a  logistic 
regression.  I  will  refer  to  this  as  "perfect  separation.” 


One  assumption  Bade  in  logistic  regression  is  that  observations  are  inde¬ 
pendent  given  their  covariates.  This  does  not  seem  to  be  a  reasonable 
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assumption  here,  and  that  ia  the  subject  of  this  section.  I  will  apply 
the  techniques  used  in  this  paper  to  two  of  the  patients. 

The  difficulties  in  fitting  a  serial  dependence  model  exceed  those  in 
fitting  an  ordinary  logistic  regression.  Clearly  for  those  patients  with 
perfect  separation,  the  serial  dependence  model  cannot  be  fit.  Even  with¬ 
out  perfect  separation,  a  high  degree  of  serial  dependence  nay  lead  to  an 
infinite  odds  ratio  estiaate. 


Another  difficulty  is  a  consequence  of  the  excellent  discriaination  pro¬ 
vided  by  these  features.  In  aany  cases  the  fitted  probability  of  a  PVC  is 
near  zero  for  normal  beats  and  naar  one  for  PVC’a.  When  the  odds  ratio  is 
large  and  two  identical  beat  types  appear  in  a  row,  the  aarginal  proba¬ 
bility  of  the  observed  pair  Bay  be  very  close  to  one.  This  leads  to 
numerical  problems  in  the  maximua  likelihood  computations. 


Logistic  regression  for  patient  217  produces  the  following  estimates: 
Coefficient  Estimate  Standard Error 


Intercept 
NPOINTS 
XCENTER 
Y CENTER 


3.946 

-.1228 

-.0003469 

-.1009 


1.242 

.03121 

.003575 

.01487 


The  score  statistic  is  “  0.23419,  giving  a  one-step  odds  ratio  esti¬ 

mate  of  1.4850.  There  is  perfect  association,  so  the  maximum  likelihood 
estimate  of  the  odds  ratio  is  infinite. 


To  examine  the  effect  on  the  coefficient  estimates,  I  computed  restricted 
maximum  likelihood  estimates  of  the  coefficients  with  the  odds  ratio  set 
at  various  values.  The  results  are  as  follows: 


Odds  Ratio 

Intercept 

NPOINTS 

XCENTER 

YCENTER 

2 

3.911 

-.122 

-.00033 

-.1000 

4 

3.879 

-.121 

-.00032 

-.0979 

8 

3.898 

-.122 

-.00033 

-.0941 

16 

3.963 

-.125 

-.00034 

-.0883 

32 

3.922 

-.125 

-.00029 

-.0825 

64 

3.810 

-.122 

-.00023 

-.0778 

128 

3.677 

-.119 

-.00018 

-.0745 

256 

3.573 

-.116 

-.00018 

-.0723 

512 

3.490 

-.114 

-.00020 

-.0709 

Throughout  this  range  the  estimates  remain  within  one  half  of  a  standard 
error  of  the  logistic  estimates,  so  the  serial  dependence  does  not  seem  to 
have  had  any  adverse  consequence  on  the  coefficient  estimates. 

Patient  210  is  an  example  of  a  data  set  with  perfect  separation.  However 
if  NPOINTS  is  not  used  the  perfect  separation  disappears,  so  I  will  per¬ 
form  logistic  regression  using  only  the  other  two  features.  I  do  this  in 
order  to  illustrate  the  effect  of  serial  dependence,  but  of  course  NPOINTS 
is  the  best  feature  in  this  case  and  it  should  not  be  ignored  in  a  search 


for  the  best  model 


The  results  are  as  follows: 


Logistic 

Logistic 

Serial  Dep.  Model 

Coef.  Estimates 

Std  Errors 

Coef.  Estimatea 

Intercept 

-.315 

.272 

-.313 

XCENTER 

.0318 

.0023 

.0322 

Y CENTER 

-.0288 

.0039 

-.0276 

The  score  statistic  is  =  1.2979,  and  the  one-step  estimate  of  the 

odds  ratio  is  4.67.  The  maximum  likelihood  estimate  of  the  odds  ratio  is 
1.79.  Here  again  there  is  not  a  significant  difference  between  the  logis¬ 
tic  and  aiaximuai  likelihood  estimates,  as  aieasnred  by  comparison  with  the 
logistic  standard  errors. 


Chapter  9 


Stunmary 

In  thie  chapter  I  give  a  brief  summary  of  the  paper.  I  also  comment  on 
alternative  models  for  aerial  dependence  in  binary  regreasion  and  I  pre¬ 
sent  some  generalizations  of  the  serial  dependence  model. 

9.1  Summary 

In  this  paper  I  have  proposed  a  regression  model  for  binary  time  series, 
by  analogy  with  the  first  order  autoregressive  model  for  normal  time 
series.  Dependence  between  successive  observations  is  measured  by  the 
odds  ratio,  and  this  odds  ratio  is  assumed  constant  over  time.  The  pro¬ 
cess  has  a  Markov  property,  so  two  observations  are  independent  given  an 
intervening  observation.  The  marginal  probability  of  {1^=1}  is  a  logistic 
function  of  covariates.  In  the  special  case  of  independence,  the  odds 
ratio  is  equal  to  one,  and  the  model  is  equivalent  to  the  ordinary  logis¬ 
tic  model. 

With  this  model  calculations  of  quantities  involving  marginal  probabili¬ 
ties  are  quite  simple.  Calculations  of  joint  probabilities  are  more  com¬ 
plicated,  but  they  are  conventiently  done  by  defining  the  quantity  at  * 
P[Yt=Yt_j*l] ,  which  is  the  solution  of  a  quadratic  equation.  The  param¬ 
eters  of  a  log  linear  representation  for  joint  probabilities  can  be  re¬ 
lated  to  the  parameters  of  this  model,  and  the  interactions  terms  are 


simple  functions  of  the  odds  ratio.  A  simple  expression  gives  the  odds 
ratio  between  observations  that  are  not  adjacent.  A  crude  bound  on  the 
expression  proves  that  the  model  generates  ‘-mixing  sequences. 

If  a  logistic  model  is  fit  to  a  process  generated  by  the  serial  dependence 
model,  inferences  about  the  coefficients  are  suspect,  because  the  standard 
errors  of  the  estimates  are  not  correct.  However  the  closest  logistic 
model  to  any  serial  dependence  model  is  the  one  with  the  same  coefficients 
if  the  distance  is  the  Kullback-Leibler  distance  using  the  serial  depen¬ 
dence  model  as  the  true  model.  This  suggests  that  the  ordinary  logistic 
coefficient  estimates  ought  to  be  consistent. 

In  fact  the  maximum  likelihood  estimates  of  the  coefficients  and  the  odds 
ratio  are  consistent  under  certain  conditions,  and  a  simulation  shows  no 
significant  difference  between  the  logistic  and  the  maximum  likelihood  co¬ 
efficient  estimates.  The  maximum  likelihood  estimates  can  be  calculated 
by  Fisher's  scoring  method,  which  is  equivalent  to  Newton's  method  with 
the  second  derivative  replaced  by  its  expectation.  A  simple  estimator  for 
the  odds  ratio  is  obtained  by  computing  the  ordinary  logistic  coefficient 
estimates  and  the  score  statistic  for  independence,  and  by  finding  the 
odds  ratio  for  which  the  expected  value  of  that  statisic  is  the  value 
actually  observed.  This  estimate  performs  about  as  well  as  the  maximum 
likelihood  for  moderate  values  of  the  odds  ratio.  For  more  extreme  values 
is  underestimates  the  magnitude  of  the  log  odda  ratio,  but  it  avoids  the 
problem  of  infinite  estimates. 


The  score  statistic  for  testing  independence  is  simply  the  autocovariance 
or  its  square  root.  The  size  and  power  of  the  test  are  well  approximated 
by  the  values  given  by  large  sample  theory.  The  accnracy  of  these  approx¬ 
imations  decreases  as  the  odds  ratio  leaves  the  range  [0.4,  2 . S] .  A 
higher  order  approximation  does  not  isqirove  the  accnracy  outside  this 
range.  For  any  given  aet  of  covariates  an  empirical  approximation  to  the 
power  function  can  be  obtained,  but  it  is  not  valid  for  other  covariates. 
This  suggests  that  the  power  depends  on  more  than  just  the  information  and 
the  odds  ratio. 

If  observations  are  missing,  the  contribution  to  the  score  statistic  (for 
testing  independence)  given  by  the  dependence  between  observations  on 
either  side  of  the  gap  is  infinitesimal,  so  the  terms  in  the  statistic 
must  be  susused  over  consecutive  observations.  An  asymptotic  relative 
efficiency  calculation  shows  that  under  some  circumstances  a  statistician 
may  prefer  to  take  a  fixed  number  of  observations  spread  out  rather  than 
consecutively.  However  these  circumstances  are  not  likely  to  occur  in 
practical  calculations,  since  they  would  require  an  extreme  prior  estimate 
of  the  odds  ratio  and  they  might  require  advance  knowledge  of  the  marginal 
probabilities. 

Graphical  displays  of  the  residuals  do  not  give  vivid  demonstration  of 
serial  correlation  here  as  they  do  in  least  squares.  However  careful 
study  of  certain  plots  can  be  revealing. 

Fitting  of  this  model  to  features  obtained  from  the  MIT/BIH  database  of 
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EKG  traces  is  complicated  by  perfect  or  near  perfect  separation  of  the 
features  of  normal  beats  and  premature  ventricular  contractions.  In  one 
case  with  perfect  association,  restricted  maximum  likelihood  estimates  of 
the  coefficients  do  not  vary  a  great  deal  as  the  odds  ratio  is  set  at 
various  values.  In  another  case  maximum  likelihood  estimation  is  possible 
and  the  estimated  odds  ratio  is  1.79.  The  alternative  estimator  gives  a 
higher  value  of  4.67. 

9.2  Other  models 

Other  models  for  serial  dependence  are  possible.  A  direct  generalization 
of  the  least  squares  model  with  an  autoregressive  error  term  is  the 
following: 

log  (pt/(l-pt))  =  Xt'0  +  e t ,  et  =  PEt-i  +  ut, 

where  {ut)  are  indepenent  normal  random  variables  with  mean  zero  and 
common  unknown  variance.  However  this  is  no  longer  a  logistic  regression 
model  even  when  p=0 .  Under  this  model  the  sequence  (pt)  has  a  joint 
logistic-normal  distribution.  Some  properties  of  this  distribution  are 
given  by  Aitchison  and  Shen  (1980).  Simple  expressions  for  the  moments 
are  not  possible,  and  maximum  likelihood  estimation  of  this  model  is 
likely  to  be  quite  difficult. 

A  simpler  model  is  the  one  used  by  Korn  and  Vhittemore  (1919) t  ?t-l  ** 
included  as  an  explanatory  variable  for  Y(.  I  will  refer  to  this  model  as 
the  "lagged  dependent  variable"  model.  Under  this  model  the  conditional 
rather  than  the  marginal  probabilities  take  the  logistic  form.  Fitting 
the  model  is  quite  simple,  since  it  can  be  done  by  ordinary  logistic 


regression.  Calculation  of  conditional  probabilities  is  also  simple. 

As  in  the  serial  dependence  model,  the  odds  ratio  between  consecutive  ob¬ 
servations  is  constant.  Since  the  process  has  a  Markov  property  many  of 
the  results  from  Chapter  2  apply  to  this  model  as  well.  In  particular  the 
expression  for  the  odds  ratio  between  distant  observations  applies,  and 
this  model  also  generates  *-mixing  sequences.  It  is  interesting  to  note 
that  the  expressions  for  ^3  in  equations  [2.4]  and  [2.5]  involve  the 
quantity  <*2(1)*  a  P>rameter  in  the  log  linear  representation  for  the  joint 
probabilities  of  three  observations.  These  joint  probabilities  cannot  be 
determined  under  the  lagged  dependent  variable  model,  because  the  marginal 
probability  of  the  first  observation  is  unspecified.  However  the  condi¬ 
tional  probabilities  are  sufficient  to  calculate 

Any  proof  of  the  consistency  and  asymptotic  normality  of  maximum  likeli¬ 
hood  estimates  that  is  valid  for  longitudinal  data  is  not  likely  to  apply 
under  the  conditions  assumed  in  this  paper.  However  a  proof  of  consis¬ 
tency  along  the  lines  of  the  one  in  Chapter  4  may  be  possible  given  the 
•-mixing  property. 

Calculation  of  marginal  probabilities  is  difficult  under  the  lagged  depen¬ 
dent  variable  model.  In  particular  the  marginal  distribution  of  Yj  is  not 
determined  by  the  conditions  I  have  stated)  it  requires  a  specified  prior, 
or  an  assumed  value  or  prior  for  Yq.  Then  the  marginal  distribution  at 
each  time  is  determined.  Unfortunately  calculating  the  marginal  distribu¬ 
tion  at  time  t  requires  summing  over  all  possible  values  of  Ys,  s<t. 
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9.3  Generalizations 

The  aerial  dependence  model  consiats  of  two  componenta:  the  relationship 
between  the  covariates  and  the  marginal  probabilities,  and  the  aerial 
dependence.  Both  of  these  could  be  generalized. 

The  logistic  model  is  perhaps  the  most  common  binary  regression  model,  but 
other  models  are  possible.  Most  take  the  form  pt  =  F(Xt'0),  where  F(.) 
is  some  continuous  cumulative  distribution  function.  This  is  the  most 
general  model  for  which  pt  is  a  continuous  monotone  function  of  Xt'0  and 
for  which  pt  approaches  zero  or  one  as  Xt'{l  approaches  plus  or  minus 
infinity.  Common  choices  for  F,  aside  from  the  logistic,  are  the  normal, 
extreme  value,  and  uniform. 

The  analysis  here  could  be  repeated  for  these  models.  Some  of  the  re¬ 
sults,  such  as  the  ‘-mixing  property,  depend  only  on  the  dependence  and 
not  on  the  form  of  F.  These  results  apply  for  any  choice  of  F. 

The  odds  ratio  does  not  extend  to  higher  dimensions.  Bnt  from  Chapter  2, 
the  constant  odds  ratio  condition  can  be  restated  as  follows:  for  all  t, 
log  P[Yt_1=i,Yt=j]  =  u(t)  +  <-l)i+1  ux(t)  +  (-l)-*+1u2(t)  +  (-l)i+ju12, 
where  u22  is  not  a  function  of  t.  This  suggests  the  following  second  or¬ 
der  model,  with  the  plus  or  minus  sign  taken  as  appropriate  to  the  usual 
log  linear  model  convention: 
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Here  the  final  four  terms  are  not  functions  of  time.  It  may  make  sense  to 
require  n^2=u23'  *n<*  special  case  t»i23“®  Kay  slso  be  interesting. 
Higher  order  generalizations  can  be  defined  similarly. 
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Consequences  of  using  an  ordinary  logistic  model  in  the 
presence  of  serial  dependence  are  explored.  The  closest  logistic  model, 
defined  as  the  one  with  the  minimum  Kullback-Leibler  distance,  is  shown 
to  be  the  one  with  the  same  marginal  probabilities.  Consistency  of 
the  maximum  likelihood  estimator  of  the  serial  dependence  model  is 
proved  under  certain  conditions,  and  a  procedure  for  finding  these 
estimates  is  given. 

Properties  of  the  model  are  found,  including  expressions  for 
the  joint  probabilities  and  the  odds  ratio  between  observations 
separated  in  time.  The  model  is  shown  to  generate*-mixing  processes. 

A  score  test  is  derived  in  order  to  test  for  independence 
after  performing  an  ordinary  logistic  regression,  and  properties  of 
this  test  are  explored.  The  effects  of  missing  data  on  the  score  test 
and  on  estimation  of  the  odds  ratio  (with  known  coefficients)  are 
presented. 

The  model  is  applied  to  the  problem  of  automatic  classification 
of  EKG  data  based  on  feature  extraction.  A  positive  serial  dependence 
is  found  in  the  examples  presented. 
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